Cloud Computing on SEO
Most of us can’t imagines our daily lives without the usual ‘clicks’ on Facebook, Twitter or how we ‘Google’ most of the tricky and rare words to get answers from all over the world. Every single business is running day by day on data analysis. Without consistent and significant data set, that is continuously accumulated and updated according to the latest users tendencies, we could not make any business related decision.
But we don’t think about the importance of all those records, the quantity and what it takes to actually gather, process and make a conclusion as we get all data almost instantly, all messages, queries and browsing across infinite web pages.
Google claimed that in 2004 they were processing 100 TB of data a day. In 2008 the daily amount of data was already 20 PB. That is 20,000 times more that before. In 2009, E-bay stated that their daily data record growing was 150 billion new records per day. To process this amount of data every day and generate insights in reasonable times, machine learning, data mining, predictive analytics algorithm research was stimulated to deal with this.
Since it stopped being a task of a single machine, Web 2.0 was re-branded to web services or Cloud services. Why Cloud? A single processor would never process the request ‘Restaurant in London’ in Google, or calculate PageRank, or go through millions of links, or Facebook Graph API processing to give you a suggestion which friend you ‘might know’ would become a very time-consuming process.
So why are those so-called ‘Cloud technologies’ are so magically powerful to allow us to reduce dramatically average algorithm processing times? It allows to split all ‘Big Data’ in smaller data sets and execute their processing in parallel on multiple clusters. Just like humans, we are more productive when we split a work among us assigning a specific task to everyone. In cloud computing, parallel hardware and software computations allow to speed up the effectiveness of parallalisation. Assuming we have N machines, our task execution time (or parallalisation speed up) can be defined as:
S(N) = T(1)/T(N),
where T(1) is the execution time of the sequential computation,T(N) the execution time of N parallel tasks.
Although, in real life the situation is not as perfect as the formula describes, due to the fraction of task that can’t be parallalised, scheduling, load balancing between processors, costs of communication and increasing the number of processors, the increase of the speed up is increased in almost in logarithmic tendency, it is still a very good achievement in time reduction.
Typical algorithm that splits the problem in sub-problems is MapReduce algorithm. Originally invented by Google, but later generalised and spread across multiple platforms, like open-source Hadoop Apache project, or ElasticMapReduce on Amazon Web Services EMR; MapReduce distributes data proceeding on multiple clusters, then gathers information and returns the result. Google processes every day thousands of MapReduce Programs on thousands of machines.
And finally, what does this have to do with SEO?
That approach has a wide implementation in Number 1 SEO index – PageRank calculation. Since we know that PR is probability distribution over nodes in the graph representing likelihood that a random walk over link will bring you to a particular node, or, in other words, it measures how frequently a web page would be encountered by surfing randomly in the World Wide Web. The program has to iterate through millions of link to measure the importance of every link and the actual PageRank of every webpage. The more incoming links the page has, the higher the probability to get to that page from other pages.