Verve Search logo

How to Crawl An Entire Industry Faster Than Most Crawl A Single Site

James Finlayson

James Finlayson

March 21, 2018

We’ve previously talked about running Screaming Frog’s Crawler on Google’s Compute Cloud. Now, I want to share how we took this to the next level – how you can use this to automatically crawl as many websites as you want simultaneously.

Running Screaming Frog In the Cloud

As a quick reminder, the advantages of running Screaming Frog in the cloud include:

  1. As it’s not running locally, it doesn’t slow your computer down so you’re free to get on with other work whilst it’s running.
  2. For the same reason, it doesn’t slow your internet connection down. Something, I’m sure, your colleagues will thank you for.
  3. It’s fast. As it uses the internet connection attached to Google’s Compute Cloud it can potentially crawl much faster than your office internet connection. We’ve seen speeds of over 1.5k URLs per second.
  4. You can run it on a computer with (practically) any amount of RAM, hard-disk space and processors you want allowing you to scale to crawl the largest websites.
  5. It’s cheap; a single Virtual Machine with 30GB of RAM will cost you less than $0.30 an hour to run and you only run it when you need it.

Why Command A Screaming Army?

Technical SEO has become increasingly complex. As a result, we rely on our technical SEOs to have an increasingly deep understanding of the subject to be able to find insights. Perhaps more importantly, though, these need to be communicated to people who are often not SEO-savvy in a way that’s compelling; that’ll inspire action. What almost all businesses care about is their competition. As a result, it’s very common, when discussing links, to compare companies to their competition.

We do this to give meaning to numbers:

DTOX Risk Comparison

Telling someone that their website has a DTOXRisk of 156 is likely meaningless, telling them that their link portfolio has ‘less than half the industry-average risk of a penalty’ is immediately accessible.

We provide industry comparisons to show the depth of the problem or opportunity:

anchor-text comparisons

Here, instead of saying that 15% of your anchor text uses commercial terms, we might comment that the analysed site has 4x the commercial anchor text than the industry average – that they’d need to increase the size of their link portfolio by 11% with purely branded anchor text just to get back to the industry average.

As almost every company has that one competitor that it really really hates, when presenting to the C-suite, we find that comparing directly with that one competitor can often yield the fastest results:

link- competitor-comparison

Something strange happens when we start to discuss technical SEO, though. We start showing relatively complex formulas to explain how fixing a canonical issue, for example, might influence total revenue generated. We ditch the competitor comparisons and don’t show graphs like these:

technical-seo-comparison

If we’re honest, the reason we don’t create graphs like these is not that they’re ineffective, but because you’d have to crawl an entire industry. Crawling an entire industry would either take a prohibitive amount of time, if using Screaming Frog, or be expensive if using a SaaS crawler like Deepcrawl.

What if, instead, you could run multiple machines at the same time, each running its own copy of Screaming Frog? That way, you could simultaneously crawl every site in an industry. Typically, this isn’t an option because they’d be fighting with each other for network bandwidth and putting in an order to your boss for 10 laptops is unlikely to get the green light. If you use Google Compute Cloud, though, it suddenly becomes possible.

In the next section, I’m going to explain how to set up a system in which you feed in a series of websites, run some scripts and, in the background, it initiates multiple virtual machines that are each running Screaming Frog Crawler and are each allocated a website from the list you gave it, to start crawling. This makes possible crawling an entire industry faster than you could typically crawl one website using just the computer that you’re reading this article on.

Side-note: did you know the collective noun for a group of frogs is an army of frogs? I didn’t…

Pre-requisites: You’ll need three things before you start. 1. a Google account 2. A debit or credit card that you Google can charge for use of their servers (you’ll be able to use them for free for a while, but have to enter card details immediately in any case) 3. you’ll need to buy one Screaming Frog license for each machine you intend to run simultaneously. That means, if you intend to crawl 10 websites at the same time you need to have at least 10 licenses. Screaming Frog is inexpensive – support its continued development.

Step 1: Creating A Virtual Machine

You start by creating a single virtual machine on Google Compute Cloud that’s running everything you’ll need to crawl a single site. I’ve written up how to do this here so pop over to that article, get set up and then return here for the next step.

NOTE: Whilst you’re getting set up note down your username shown in the terminal (the name before the @ on each line) – you’ll need it later on and now is the most convenient time to note it down.

Step 2: Connecting Your VM to the Outside World

Welcome back. You should now have a virtual machine, running Screaming Frog Crawler and Chrome. Now, we need to create a way to automagically control that Virtual Machine. Luckily, it’s pretty simple:

  1. VNC into the Virtual Machine
  2. Open Google Chrome (it’ll be under ‘Applications’ ‘Internet’ and will have been installed via one of the scripts you ran previously).
  3. Load up this post in Google Chrome and download screaming-frog-auto-start.sh by clicking here. Save it to your virtual machine’s desktop.
  4. Open ‘Applications’ ‘Settings’ ‘Session and Startup’ and click on the ‘Application Autostart’ tab.
  5. Click ‘Add’, then the folder icon, choosing the ‘/’ in the left-hand box
  6. Browse to ‘usr’, then ‘bin’ and select ‘google-chrome’ and press OK
  7. Name it ‘Chrome’ and then click OK. (you’ve just set Google Chrome to auto-start as you’ll almost certainly open it up every time to save the output of Screaming Frog in any case)
  8. Click ‘Add’ again, then the folder icon, choosing ‘Desktop’ this time and selecting the script you previously downloaded.
  9. Click OK, name it anything you like (I went with ‘screaming-start’) and click OK again.
  10. Then click ‘Close’ and you’re done.

With these steps you’ve set Linux to boot Chrome on startup, run a script that pulls metadata set for that machine (we’ll be setting that in another script) containing the URL of the site to crawl and then start Screaming Frog with an instruction to crawl that site.

Step 3: Set Up Screaming Frog

Currently, Screaming Frog has all the default options and doesn’t even have a license key entered. You’ll find Screaming Frog under ‘Applications’ ‘Internet’. Load it up, enter a license key, and set the settings up how you like them. As the internet connection is so good – and you don’t have to worry about slowing it down for your colleagues – I typically set it to crawl with a maximum of 100 threads, though be wary of the type of sites you’re crawling as this would be enough to take down many smaller sites, which is not what you’re trying to achieve! When you have the settings how you like them, close Screaming Frog and close the Virtual Machine window.

Pop into Google Cloud Console and stop the instance, so you’re not charged for it doing anything else.

Step 4: Set Up the Virtual Machine as a Snapshot

Your virtual machine is all setup, but now we have to make it easily reproducible. We do this by creating a Snapshot. Snapshots are also compressed and so are cheaper to store than Virtual Machines themselves. Here’s how:

  1. Log in to Google Compute Cloud Console and, from the left-hand menu, select ‘Snapshots’.
  2. Click ‘Create Snapshot’
  3. Name it ‘screaming-snapshot’ and then select whatever you called the virtual machine you’ve been working from thus far from the ‘Source disk’ menu.
  4. Click ‘Create’
  5. You can now click back into ‘VM Instances’ and delete your virtual machine – you’ve backed it up in the previous step.

Step 5: Setting Up Python

The script that automates everything for you is written in a programming language called Python. It’s great, I’m terrible at it, feel free to look at my rag-tag pieces of code if you would like a comprehensive lesson on how not to implement best practice and, generally, want to amuse yourself.

If you’ve not used Python before on the computer you’re on then follow these guides to get set up:

NOTE: The guides above will take you through installing Python 2.7 rather than the latest version, Python 3. For historical reasons we use Python 2.7 – I’m sure with a few changes you could probably get the script working in Python 3 too. If you don’t know what the difference between 2.7 and 3 are then please ignore this segue entirely.

Step 6: Download and Edit the Scripts

You now have a virtual machine template that, when booted, will open VNC Server on a port you tell it, open Screaming Frog and begin crawling a site of your choice. Now, we need to create a script that automatically creates copies of that virtual machine template, boots them and provides them with the right details to get on with their work.

  1. Create a folder on your computer where the scripts can live
  2. Create a text file called ‘sites-to-crawl.txt’. The text file should contain the absolute URLs of the sites you want to crawl at that time, with each site on a new line: sites-to-crawl_txt
  3. Next, we’ll be saving and editing the Python code that pulls everything together. Download our template files here, and here, saving them in the same directory you saved the sites-to-crawl.txt file.
  4. Once downloaded, open the files in your favourite editor (I like Sublime, our devs like Visual Studio –  though you could just use TextEdit if you don’t want to waste time installing another thing).
  5. Search the file for [[[ ]]] sections. These should be replaced (including the brackets) with inputs that are specific to your setup. We’ve explained within the brackets of each what’s required.
  6. Now download terminal.py from here (more info on this awesomely useful script here) and save it in the same directory.

Step 7: Setting Up VNC Viewer

In the guide to set up your original virtual machine, you will have setup VNC Viewer to connect to one virtual machine at 127.0.0.1:5901. You now need to add connections to connect to as many virtual machines as you think you might create.

  1. Open VNC Viewer
  2. Right-click on your existing VNC connection and choose ‘Duplicate’.
  3. Repeat the step above until you have as many as the number of sites you think you might want to crawl simultaneously
  4. Now right click on each new connection that you created and choose ‘Properties’.
  5. Change the port number (the part of the text box next to ‘VNC Server’ that has a four-digit number) to match the order of virtual machine it is – e.g. your original connection should be 5901, your next should be 5902, the next 5903 etc).
  6. Change the name given to that connection to whatever you want – I’m boring and so use ‘1’,’2′,’3′ etc, but feel free to name them after dinosaurs, power rangers or wizarding houses if you really want to. I won’t judge you. Well, maybe a little.

Step 8: In Which the Magic Happens

You’re now all set to go!

  1. List all the sites you want to crawl in sites-to-crawl.txt, saving and closing the file.
  2. Open up a Terminal (or Command Prompt) window and change the folder to the folder you have the Python code in (using ‘cd’ and then the name of the next layer of directory)
  3. When in the correct folder, type ‘python scream.py sites-to-crawl.txt’
  4. The code will now begin to run – don’t worry about any errors about hard-drives being small as, in practice, it’s not a problem.
  5. After a couple of minutes (depending on how many websites you’re looking to crawl) it will have spun up each machine for you and they’ll already be crawling the sites.
  6. You can now type in your SSH password (or set one if this is your first time booting into that instance number) and then connect to them using the VNC viewer connections we set previously. You don’t have to rush this last step – the VM’s are already crawling the sites for you, so connect whenever you’re ready.
  7. When you’re done with the VMs, click back to the main script window and press Enter – the script will automatically delete each of the automated instances for you. If you’re at all unsure if it’s managed to do so please do check Google’s cloud compute console – a virtual machine left running will cause an unexpected bill.

IMPORTANT NOTE: You need to buy one Screaming Frog licenses for each machine you intend to run simultaneously. That means, if you intend to crawl 10 websites at the same time you need to have at least 10 licenses. Screaming Frog is inexpensive – support its continued development.

FAQs

I’m getting an error

Try re-running the main script. If you were part way through running the script when you got the error, running the script again will cause errors at every step leading up to where it previously failed – but don’t worry, these errors are just due to those steps having already completed – it should continue from where it left off.

How do I get the crawl files off the VMs?

That’s up to you, though I find it quickest to open up a Dropbox file request link in Chrome on the virtual machine and then save the files there – that way they quickly go into a shared folder in your existing filing (if you already use Dropbox).

Changes I make to automated instances aren’t saved.

Yes, that’s by design. If you want to make changes to an automated instance you need to change the snapshot itself. An easy way to do this would be to change the automated snapshot, delete the existing snapshot and then save the automated instance as the new snapshot.

I get an error when I try to run more than X Virtual Machines

By default, Google limits you to 24 CPUs per region. If you’re bumping into this limit you have three choices:

  1. Decrease the number of CPUs per virtual machine so that your CPU quota stretches to more machines. Note, however, that for some reason network bandwidth is allocated on a per-CPU basis so the fewer CPUs you provision per VM the slower it’ll crawl.
  2. Edit the script so that it tracks how many CPUs it’s provisioned and changes zone when you’ve hit your limit. Note that not all zones provide access to exactly the same (virtual) hardware and that most are a little bit more expensive than the one chosen, which is why it was chosen in the first place.
  3. Request an increased quota for ‘Google Compute Engine API CPUs (all regions)’ on the Quotas page. This isn’t instant, but it also isn’t something you’ll be waiting months to hear back on.

This sounds really expensive?

As you’re only running the virtual machines when you need them, it actually works out really cheap. Each virtual machine, depending on how you spec it, costs around $0.30 per hour. That means if you’re crawling 10 sites simultaneously and the whole thing takes you 30 minutes, you’ll have only spent $1.50 (plus the cost of 10 Screaming Frog Crawler licenses). As each instance crawls surprisingly fast, you’ll find that entire industries can often be crawled for less than that.

Read another blog post