Running Screaming Frog on Google’s Own Servers
Screaming Frog is the industry standard SEO crawler, but it can slow down your computer and monopolise your internet connection. By running it, instead, on Google’s Compute Cloud it can run significantly faster and you’re free to get on with other jobs whilst it runs. Moreover, if you can run one Google Compute Cloud Instance, you can run many.
In this part 1 of 2, I’ll explain how to setup Screaming Frog to run in the cloud and then, in part 2, I’ll explain how to take it to the next level and use it to crawl potentially hundreds of websites, simultaneously, faster than you’ve ever done before.
Before you start this part 1, you’ll need:
- a Google account
- A card that you Google can charge for use of their servers (you’ll be able to use them for free for a while, but have to enter card details immediately in any case)
- a Screaming Frog license. If you’re planning to simultaneously use Screaming Frog on your computer and the cloud you’ll need two licenses. Screaming Frog is inexpensive – support its continued development.
Now, here’s how to get it running:
- Go to the Google Compute Cloud Console and sign up for the free trial.
- Create a new project, calling it whatever you’d like.
- Click on ‘Compute Engine’ ‘VM instances’ and create a new instance. Here are the setting you’ll need:
- Once entered, you’ll be back on the virtual machine page and your shiny new virtual machine will begin booting. Click on ‘SSH’ ‘Open in Browser Window’ and you’ll have a box like this appear:
- enter ‘sudo add-apt-repository ppa:webupd8team/java’
- enter ‘sudo apt-get update’
- enter ‘sudo apt-get upgrade’
- enter ‘sudo apt-get install xfce xfce-goodies autocutsel tightvncserver’
- enter ‘sudo apt-get install libxss1 libappindicator1 libindicator7 fonts-liberation’
- enter ‘sudo apt-get install oracle-java8-installer openjdk-8-jre-headless libatk-wrapper-java-jni libgif7 fonts-dejavu-extra java-common gsfonts-x11 oracle-java8-set-default openjdk-8-jre gsfonts libatk-wrapper-java ca-certificates-java libpcsclite1′
- enter ‘touch ~/.Xresources’
- enter ‘wget https://d1.google.com/linux/direct/google-chrome-stable_current_amd64.deb’
- enter ‘sudo dpkg -i google-chrome*.deb’
- enter ‘wget https://download.screamingfrog.co.uk/products/seo-spider/screamingfrogseospider_9.1_all.deb’
- enter ‘sudo dpkg -i screamingfrogseospider_9.1_all.deb’
- enter ‘vncserver’ and set whatever password you like – I recommend not bothering with a view-only password.
You now have everything setup in the cloud – you just can’t access it. To do that you need to create an SSH connection and install a VNCViewer.
Creating an SSH Connection
We first need to install GCloud SDK.
On Mac OS X
- Open Terminal and enter ‘curl https://sdk.cloud.google.com | bash’
- enter ‘exec -l $SHELL’
- enter ‘gcloud init’
- enter ‘gcloud login’
- Download and run the Cloud SDK Installer.
- Choose ‘start Cloud SDK Shell’ and ‘Run gcloud init’
- In command prompt, run ‘gcloud login’
Once the above is complete, enter ‘gcloud compute ssh [the name of your virtual machine] –project [the unique name of your project] –zone us-central1-f –ssh-flag “-L 5901:localhost:5901″
It will ask you to set a password and then the connection will be made.
Installing & Setting up VNCViewer
- I like RealVNC’s VNC Viewer, though it doesn’t really matter which you use. You can download it here.
- Once installed/opened, click on ‘File’ ‘New’ and enter ‘127.0.01:5901′ as the VNC Server – name it whatever you like.
- Once saved, click on the server’s icon and type in the password you set when you typed ‘vncserver’ earlier. A window should then open connecting you to your virtual machine:
And there you have it – you’re all set to run your first crawl. One important thing to remember is that you’re paying (or using your free trial credits) whilst the computer is running (and only a tiny amount when it’s not) so, when you’re done with the VM, shut down the terminal window, VNC viewer and in the Google Cloud Compute Console, stop the machine.
You’re ready to crawl hundreds of thousands of URLs without having to worry about it slowing down your internet connection or computer. BUT I wouldn’t recommend stopping here. Ultimately the above is a lot of work if all you’re trying to do is stop colleagues complaining about you bogging down the connection. The real power of this doesn’t come from running a single one – it happens when you automate the whole thing, running multiple instances at the same time, giving you a huge speed boost that’ll allow you to crawl entire industries faster than you’ve previously crawled single sites. Luckily, that’s a lot easier than it sounds. So jump over to part 2 here to turn your basic setup into something that’ll change the way you work.