Verve
Blog
Home > Blog > How To Make A Simple Web Crawler In Go

How to make a simple web crawler in GO

web-crawlerWriting web crawlers is one of the most interesting parts of my job. The reason I love it so much is because it is both simple and complex. Simple because the basic idea is as easy as following these four steps:

  1. Send a GET request to the URL
  2. Extract the content from the response.
  3. Extract any urls that can be crawled next
  4. Repeat

The steps above hold true when we’re crawling on a single thread and sending a single request at a time. The problem here is that although this works efficiently with smaller websites, it would take ages to crawl bigger ones. Not to mention it would be a waste of computing power, network resources, and I/O operations. Computers nowadays can perform multiple network requests at the same time; processing data and performing I/O operations in parallel.

To illustrate this, think of the above steps as a ‘relay race’ with one race track (the computer), and one team of 4 runners using just one lane. The first runner finishes sending the ‘GET’ request, the next receives the response then extracts the content, and so on. But, doesn’t this seems like we’re wasting the race track? Why not have other teams join in the race simultaneously?

This example might not fit in perfectly with what happens in a real-life multi-threaded application, I believe the basic principles still hold true. We need more ‘teams’ making better use of the race track (the computer).

In computer programs this is done through threads, which if we go back to our race track example, one can think of as ‘track lanes’. Similarly to a track lane which is allocated to a given team in a race, a thread is basically a set of computer resources (memory, I/O, etc) allocated to execute a task without affecting the execution of other tasks.

Most programming languages support multi-threading (i.e. executing tasks in parallel/concurrently), however, this support comes at the cost of code complexity. Moreover, in most programming languages, writing multi-threaded applications is often cumbersome, requires lots of testing and debugging, a good knowledge of multi-threading issues and how to deal with them. ‘Go’ on the other hand was built with concurrency in mind. It differs from other languages in the fact that it has built-in concurrency features that makes it ideal for applications. I highly recommend you look through the Go language documentation to learn more.

Without going further into the technical detail, I’ll show the code to help get you started when writing your own Go crawler.

 

Happy crawling! Let us know how you get on.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">

6 thoughts on “How to make a simple web crawler in GO

  1. MINHAJ UDDIN

    Hi Suhail,

    Yes, i am trying to find an easy way to make web crawler!
    Definitely this article is very helpful for me.
    Very detailed article.
    Most thanks for showing the code.
    Keep it up!
    Have a nice day!

    Reply