Tag Archives: Google Merchant


How to make a simple web crawler in GO

web-crawlerWriting web crawlers is one of the most interesting parts of my job. The reason I love it so much is because it is both simple and complex. Simple because the basic idea is as easy as following these four steps:

  1. Send a GET request to the URL
  2. Extract the content from the response.
  3. Extract any urls that can be crawled next
  4. Repeat

The steps above hold true when we’re crawling on a single thread and sending a single request at a time. The problem here is that although this works efficiently with smaller websites, it would take ages to crawl bigger ones. Not to mention it would be a waste of computing power, network resources, and I/O operations. Computers nowadays can perform multiple network requests at the same time; processing data and performing I/O operations in parallel.

To illustrate this, think of the above steps as a ‘relay race’ with one race track (the computer), and one team of 4 runners using just one lane. The first runner finishes sending the ‘GET’ request, the next receives the response then extracts the content, and so on. But, doesn’t this seems like we’re wasting the race track? Why not have other teams join in the race simultaneously?

This example might not fit in perfectly with what happens in a real-life multi-threaded application, I believe the basic principles still hold true. We need more ‘teams’ making better use of the race track (the computer).

In computer programs this is done through threads, which if we go back to our race track example, one can think of as ‘track lanes’. Similarly to a track lane which is allocated to a given team in a race, a thread is basically a set of computer resources (memory, I/O, etc) allocated to execute a task without affecting the execution of other tasks.

Most programming languages support multi-threading (i.e. executing tasks in parallel/concurrently), however, this support comes at the cost of code complexity. Moreover, in most programming languages, writing multi-threaded applications is often cumbersome, requires lots of testing and debugging, a good knowledge of multi-threading issues and how to deal with them. ‘Go’ on the other hand was built with concurrency in mind. It differs from other languages in the fact that it has built-in concurrency features that makes it ideal for applications. I highly recommend you look through the Go language documentation to learn more.

Without going further into the technical detail, I’ll show the code to help get you started when writing your own Go crawler.

package main

//import packages
import (
	"fmt"		//GO’s base package 
	"io/ioutil"	//reading/writing data from input/output streams 
	"net/http"	//for sending HTTP requests
	"net/url"	//for URL formatting
	"regexp"	//regular expressions
	"runtime"	//GO runtime (used to set the number of threads to be used)
	"strings"	//string manipulation and testing

//how many threads to use within the application
const NCPU = 8

//URL filter function definition
type filterFunc func(string, Crawler) bool

//Our crawler structure definition
type Crawler struct {
	//the base URL of the website being crawled
	host string
	//a channel on which the crawler will receive new (unfiltered) URLs to crawl
	//the crawler will pass everything received from this channel
	//through the chain of filters we have
	//and only allowed URLs will be passed to the filteredUrls channel
	urls chan string
	//a channel on which the crawler will receive filtered URLs.
	filteredUrls chan string //a channel
	//a slice that contains the filters we want to apply on the URLs.
	filters []filterFunc
	//a regular expression pointer to the RegExp that will be used to extract the
	//URLs from each request.
	re *regexp.Regexp
	//an integer to track how many URLs have been crawled
	count int

//starts the crawler
//the method starts two GO functions
//the first one waits for new URLs as they
//get extracted.
//the second waits for filtered URLs as they
//pass through all the registered filters
func (crawler *Crawler) start() {
	//wait for new URLs to be extracted and passed to the URLs channel.
	go func() {
		for n := range crawler.urls {
			//filter the url
			go crawler.filter(n)

	//wait for filtered URLs to arrive through the filteredUrls channel
	go func() {
		for s := range crawler.filteredUrls {
			//print the newly received filtered URL
			//increment the crawl count
			//print crawl count
			//start a new GO routine to crawl the filtered URL
			go crawler.crawl(s)

//given a URL, the method will send an HTTP GET request
//extract the response body
//extract the URLs from the body
func (crawler *Crawler) crawl(url string) {
	//send http request
	resp, err := http.Get(url)
	if err != nil {
		fmt.Println("An error has occured")
	} else {
		defer resp.Body.Close()
		body, err := ioutil.ReadAll(resp.Body)
		if err != nil {
			fmt.Println("Read error has occured")
		} else {
			strBody := string(body)
			crawler.extractUrls(url, strBody)


//adds a new URL filter to the crawler
func (crawler *Crawler) addFilter(filter filterFunc) Crawler {
	crawler.filters = append(crawler.filters, filter)
	return crawler

//stops the crawler by closing both the URLs channel
//and the filtered URLs channel
func (crawler *Crawler) stop() {

//given a URL, the method will apply all the filters
//on that URL, if and only if, it passes through all
//the filters, it will then be passed to the filteredUrls channel
func (crawler *Crawler) filter(url string) {
	temp := false
	for _, fn := range crawler.filters {
		temp = fn(url, crawler)
		if temp != true {
	crawler.filteredUrls <- url

//given the crawled URL, and its body, the method
//will extract the URLs from the body
//and generate absolute URLs to be crawled by the
//the extracted URLs will be passed to the URLs channel
func (crawler *Crawler) extractUrls(Url, body string) {
	newUrls := crawler.re.FindAllStringSubmatch(body, -1)
	u := ""
	baseUrl, _ := url.Parse(Url)
	if newUrls != nil {
		for _, z := range newUrls {
			u = z[1]
			ur, err := url.Parse(z[1])
			if err == nil {
				if ur.IsAbs() == true {
					crawler.urls <- u
				} else if ur.IsAbs() == false {
					crawler.urls <- baseUrl.ResolveReference(ur).String()
				} else if strings.HasPrefix(u, "//") {
					crawler.urls <- "http:" + u
				} else if strings.HasPrefix(u, "/") {
					crawler.urls <- crawler.host + u
				} else {
					crawler.urls <- Url + u

func main() {
	//set how many processes (threads to use)

	//create a new instance of the crawler structure
	c := Crawler{
		make(chan string),
		make(chan string),
		make([]filterFunc, 0),
		regexp.MustCompile("(?s)<a[ t]+.*?href="(http.*?)".*?>.*?</a>"),
	//add our only filter which makes sure that we are only
	//crawling internal URLs.
	c.addFilter(func(Url string, crawler Crawler) bool {
		return strings.Contains(Url, crawler.host)

	c.urls <- c.host

	var input string


Happy crawling! Let us know how you get on.

How to Maximise Your PPC Seller Ratings (Post 04/09/11 Update)

Nothing Lasts Forever – this should be the slogan for Google and their regular updates. A couple of weeks ago Google introduced an update on the seller rating extensions. Is Google just playing with us, and what is the true motive for this update? Get your notepads out and listen carefully:

Let’s just turn back the wheels of time to the year 2009, where Google introduced rich snippets, in order to represent search results describing people or containing reviews. Until now there were minor additions of rich snippets, such as products, events, recipes and many more.

Rich Snippets

The seller rating extension rolled out two years ago and helped people/customers to identify highly rated merchants, when they are searching for goods or services on Google. Your merchant star rating from Google Product Search was attached to your AdWords ads. Consequently, these star ratings were aggregated from review sites (Trustpilot, Bizrate, PriceGrabber) from all around the web, allowing people to find merchants that are highly recommended by other online shoppers.
Seller Rating Extensions
In order to qualify for this extension in the PAST, the seller needed 30 reviews in the past history of their account and an average 4-star rating from those reviews. Moreover, if someone clicked on your review link, the click was free! Okay, not everything was free, though … Clicks to the headline were still charged!

Back to the actual update. Most things have remained the same; clicks to your review are still free and you have to meet their requirements in order to use this feature. However, the requirements have been slightly updated.

The new qualifications require the seller to have at least 30 reviews over a period of 12 months, whilst simultaneously maintaining a 4-star average rating. 

Doesn’t sound too bad, does it?

Looking at it from a sellers point of view, the biggest impact will obviously be felt by the merchants, who haven’t paid enough attention to receiving constant reviews. The main point for online shoppers was to avoid long queues and all the people they would have to deal with. However, when this extension was first introduced, all sellers were focused on getting their customers to review their site/ prices/ quality. Once the seller received 30 reviews the ratings would populate automatically. Hence, Google might have been thinking of urging all merchants to get those reviews rolling in, again. So get a hold of your creative advertisers out there and start planning an effective strategy.

How to get more reviews from your customers?

There are many possibilities to increase your review status, but the actual work will be to maintain your seller rating extension by regularly receiving reviews. Bad luck for those who have just been focusing on their 30 reviews. Hence, you might consider implementing reminder links on the final checkout page or you could offer coupons in return for a review – a nice incentive for customers.

Moreover, if you have seen your CTRs and/or conversions increasing after the introduction of the seller ratings, you may experience a sudden decrease due to reviews and ratings disappearing. This might lead to consumers being more insecure on clicking on the link without that extension, which basically leads to a loss of a prospective review.

What’s the point in getting those seller ratings back?

According to Google, merchants with seller ratings are able to receive an increase of 17% in click through rate. Ha! – more clicks always means more sales, shouldn’t it? This line only assumes if you effectively spend time on the conversion rate optimisation and analyse the checkout procedure.

Furthermore, didn’t it used to be good business practice, and a primary goal of a company, to receive continuous feedback? You will have to adapt to the market trends and current consumer tastes, in order to maximise sales. However, you’ll only get that data if you get those reviews.

Points to remember

1) There is a possible 17% CTR increase by implementing the Seller Rating Extensions

2) You need to make sure your customers are leaving reviews after they checkout

3) Keep an eye on the conversion rate optimisation to target that 17% increase!

In the end, you only need to ask your customers politely for reviews AND keep an average star rating of 4, which correlates to good customer service/experience. Then you will be able to receive that all important 17% increase in Click-Through Rate, as well!