Verve Search logo

How to Crawl An Entire Industry Faster Than Most Crawl A Single Site

We’ve previously talked about running Screaming Frog’s Crawler on Google’s Compute Cloud. Now, I want to share how we took this to the next level – how you can use this to automatically crawl as many websites as you want simultaneously.

Running Screaming Frog In the Cloud

As a quick reminder, the advantages of running Screaming Frog in the cloud include:

  1. As it’s not running locally, it doesn’t slow your computer down so you’re free to get on with other work whilst it’s running.
  2. For the same reason, it doesn’t slow your internet connection down. Something, I’m sure, your colleagues will thank you for.
  3. It’s fast. As it uses the internet connection attached to Google’s Compute Cloud it can potentially crawl much faster than your office internet connection. We’ve seen speeds of over 1.5k URLs per second.
  4. You can run it on a computer with (practically) any amount of RAM, hard-disk space and processors you want allowing you to scale to crawl the largest websites.
  5. It’s cheap; a single Virtual Machine with 30GB of RAM will cost you less than $0.30 an hour to run and you only run it when you need it.

Why Command A Screaming Army?

Technical SEO has become increasingly complex. As a result, we rely on our technical SEOs to have an increasingly deep understanding of the subject to be able to find insights. Perhaps more importantly, though, these need to be communicated to people who are often not SEO-savvy in a way that’s compelling; that’ll inspire action. What almost all businesses care about is their competition. As a result, it’s very common, when discussing links, to compare companies to their competition.

We do this to give meaning to numbers:

DTOX Risk Comparison

Telling someone that their website has a DTOXRisk of 156 is likely meaningless, telling them that their link portfolio has ‘less than half the industry-average risk of a penalty’ is immediately accessible.

We provide industry comparisons to show the depth of the problem or opportunity:

anchor-text comparisons

Here, instead of saying that 15% of your anchor text uses commercial terms, we might comment that the analysed site has 4x the commercial anchor text than the industry average – that they’d need to increase the size of their link portfolio by 11% with purely branded anchor text just to get back to the industry average.

As almost every company has that one competitor that it really really hates, when presenting to the C-suite, we find that comparing directly with that one competitor can often yield the fastest results:

link- competitor-comparison

Something strange happens when we start to discuss technical SEO, though. We start showing relatively complex formulas to explain how fixing a canonical issue, for example, might influence total revenue generated. We ditch the competitor comparisons and don’t show graphs like these:

technical-seo-comparison

If we’re honest, the reason we don’t create graphs like these is not that they’re ineffective, but because you’d have to crawl an entire industry. Crawling an entire industry would either take a prohibitive amount of time, if using Screaming Frog, or be expensive if using a SaaS crawler like Deepcrawl.

What if, instead, you could run multiple machines at the same time, each running its own copy of Screaming Frog? That way, you could simultaneously crawl every site in an industry. Typically, this isn’t an option because they’d be fighting with each other for network bandwidth and putting in an order to your boss for 10 laptops is unlikely to get the green light. If you use Google Compute Cloud, though, it suddenly becomes possible.

In the next section, I’m going to explain how to set up a system in which you feed in a series of websites, run some scripts and, in the background, it initiates multiple virtual machines that are each running Screaming Frog Crawler and are each allocated a website from the list you gave it, to start crawling. This makes possible crawling an entire industry faster than you could typically crawl one website using just the computer that you’re reading this article on.

Side-note: did you know the collective noun for a group of frogs is an army of frogs? I didn’t…

Pre-requisites: You’ll need three things before you start. 1. a Google account 2. A debit or credit card that you Google can charge for use of their servers (you’ll be able to use them for free for a while, but have to enter card details immediately in any case) 3. you’ll need to buy one Screaming Frog license for each machine you intend to run simultaneously. That means, if you intend to crawl 10 websites at the same time you need to have at least 10 licenses. Screaming Frog is inexpensive – support its continued development.

Step 1: Creating A Virtual Machine

You start by creating a single virtual machine on Google Compute Cloud that’s running everything you’ll need to crawl a single site. I’ve written up how to do this here so pop over to that article, get set up and then return here for the next step.

NOTE: Whilst you’re getting set up note down your username shown in the terminal (the name before the @ on each line) – you’ll need it later on and now is the most convenient time to note it down.

Step 2: Connecting Your VM to the Outside World

Welcome back. You should now have a virtual machine, running Screaming Frog Crawler and Chrome. Now, we need to create a way to automagically control that Virtual Machine. Luckily, it’s pretty simple:

  1. VNC into the Virtual Machine
  2. Open Google Chrome (it’ll be under ‘Applications’ ‘Internet’ and will have been installed via one of the scripts you ran previously).
  3. Load up this post in Google Chrome and download screaming-frog-auto-start.sh by clicking here. Save it to your virtual machine’s desktop.
  4. Open ‘Applications’ ‘Settings’ ‘Session and Startup’ and click on the ‘Application Autostart’ tab.
  5. Click ‘Add’, then the folder icon, choosing the ‘/’ in the left-hand box
  6. Browse to ‘usr’, then ‘bin’ and select ‘google-chrome’ and press OK
  7. Name it ‘Chrome’ and then click OK. (you’ve just set Google Chrome to auto-start as you’ll almost certainly open it up every time to save the output of Screaming Frog in any case)
  8. Click ‘Add’ again, then the folder icon, choosing ‘Desktop’ this time and selecting the script you previously downloaded.
  9. Click OK, name it anything you like (I went with ‘screaming-start’) and click OK again.
  10. Then click ‘Close’ and you’re done.

With these steps you’ve set Linux to boot Chrome on startup, run a script that pulls metadata set for that machine (we’ll be setting that in another script) containing the URL of the site to crawl and then start Screaming Frog with an instruction to crawl that site.

Step 3: Set Up Screaming Frog

Currently, Screaming Frog has all the default options and doesn’t even have a license key entered. You’ll find Screaming Frog under ‘Applications’ ‘Internet’. Load it up, enter a license key, and set the settings up how you like them. As the internet connection is so good – and you don’t have to worry about slowing it down for your colleagues – I typically set it to crawl with a maximum of 100 threads, though be wary of the type of sites you’re crawling as this would be enough to take down many smaller sites, which is not what you’re trying to achieve! When you have the settings how you like them, close Screaming Frog and close the Virtual Machine window.

Pop into Google Cloud Console and stop the instance, so you’re not charged for it doing anything else.

Step 4: Set Up the Virtual Machine as a Snapshot

Your virtual machine is all setup, but now we have to make it easily reproducible. We do this by creating a Snapshot. Snapshots are also compressed and so are cheaper to store than Virtual Machines themselves. Here’s how:

  1. Log in to Google Compute Cloud Console and, from the left-hand menu, select ‘Snapshots’.
  2. Click ‘Create Snapshot’
  3. Name it ‘screaming-snapshot’ and then select whatever you called the virtual machine you’ve been working from thus far from the ‘Source disk’ menu.
  4. Click ‘Create’
  5. You can now click back into ‘VM Instances’ and delete your virtual machine – you’ve backed it up in the previous step.

Step 5: Setting Up Python

The script that automates everything for you is written in a programming language called Python. It’s great, I’m terrible at it, feel free to look at my rag-tag pieces of code if you would like a comprehensive lesson on how not to implement best practice and, generally, want to amuse yourself.

If you’ve not used Python before on the computer you’re on then follow these guides to get set up:

NOTE: The guides above will take you through installing Python 2.7 rather than the latest version, Python 3. For historical reasons we use Python 2.7 – I’m sure with a few changes you could probably get the script working in Python 3 too. If you don’t know what the difference between 2.7 and 3 are then please ignore this segue entirely.

Step 6: Download and Edit the Scripts

You now have a virtual machine template that, when booted, will open VNC Server on a port you tell it, open Screaming Frog and begin crawling a site of your choice. Now, we need to create a script that automatically creates copies of that virtual machine template, boots them and provides them with the right details to get on with their work.

  1. Create a folder on your computer where the scripts can live
  2. Create a text file called ‘sites-to-crawl.txt’. The text file should contain the absolute URLs of the sites you want to crawl at that time, with each site on a new line: sites-to-crawl_txt
  3. Next, we’ll be saving and editing the Python code that pulls everything together. Download our template files here, and here, saving them in the same directory you saved the sites-to-crawl.txt file.
  4. Once downloaded, open the files in your favourite editor (I like Sublime, our devs like Visual Studio –  though you could just use TextEdit if you don’t want to waste time installing another thing).
  5. Search the file for [[[ ]]] sections. These should be replaced (including the brackets) with inputs that are specific to your setup. We’ve explained within the brackets of each what’s required.
  6. Now download terminal.py from here (more info on this awesomely useful script here) and save it in the same directory.

Step 7: Setting Up VNC Viewer

In the guide to set up your original virtual machine, you will have setup VNC Viewer to connect to one virtual machine at 127.0.0.1:5901. You now need to add connections to connect to as many virtual machines as you think you might create.

  1. Open VNC Viewer
  2. Right-click on your existing VNC connection and choose ‘Duplicate’.
  3. Repeat the step above until you have as many as the number of sites you think you might want to crawl simultaneously
  4. Now right click on each new connection that you created and choose ‘Properties’.
  5. Change the port number (the part of the text box next to ‘VNC Server’ that has a four-digit number) to match the order of virtual machine it is – e.g. your original connection should be 5901, your next should be 5902, the next 5903 etc).
  6. Change the name given to that connection to whatever you want – I’m boring and so use ‘1’,’2′,’3′ etc, but feel free to name them after dinosaurs, power rangers or wizarding houses if you really want to. I won’t judge you. Well, maybe a little.

Step 8: In Which the Magic Happens

You’re now all set to go!

  1. List all the sites you want to crawl in sites-to-crawl.txt, saving and closing the file.
  2. Open up a Terminal (or Command Prompt) window and change the folder to the folder you have the Python code in (using ‘cd’ and then the name of the next layer of directory)
  3. When in the correct folder, type ‘python scream.py sites-to-crawl.txt’
  4. The code will now begin to run – don’t worry about any errors about hard-drives being small as, in practice, it’s not a problem.
  5. After a couple of minutes (depending on how many websites you’re looking to crawl) it will have spun up each machine for you and they’ll already be crawling the sites.
  6. You can now type in your SSH password (or set one if this is your first time booting into that instance number) and then connect to them using the VNC viewer connections we set previously. You don’t have to rush this last step – the VM’s are already crawling the sites for you, so connect whenever you’re ready.
  7. When you’re done with the VMs, click back to the main script window and press Enter – the script will automatically delete each of the automated instances for you. If you’re at all unsure if it’s managed to do so please do check Google’s cloud compute console – a virtual machine left running will cause an unexpected bill.

IMPORTANT NOTE: You need to buy one Screaming Frog licenses for each machine you intend to run simultaneously. That means, if you intend to crawl 10 websites at the same time you need to have at least 10 licenses. Screaming Frog is inexpensive – support its continued development.

FAQs

I’m getting an error

Try re-running the main script. If you were part way through running the script when you got the error, running the script again will cause errors at every step leading up to where it previously failed – but don’t worry, these errors are just due to those steps having already completed – it should continue from where it left off.

How do I get the crawl files off the VMs?

That’s up to you, though I find it quickest to open up a Dropbox file request link in Chrome on the virtual machine and then save the files there – that way they quickly go into a shared folder in your existing filing (if you already use Dropbox).

Changes I make to automated instances aren’t saved.

Yes, that’s by design. If you want to make changes to an automated instance you need to change the snapshot itself. An easy way to do this would be to change the automated snapshot, delete the existing snapshot and then save the automated instance as the new snapshot.

I get an error when I try to run more than X Virtual Machines

By default, Google limits you to 24 CPUs per region. If you’re bumping into this limit you have three choices:

  1. Decrease the number of CPUs per virtual machine so that your CPU quota stretches to more machines. Note, however, that for some reason network bandwidth is allocated on a per-CPU basis so the fewer CPUs you provision per VM the slower it’ll crawl.
  2. Edit the script so that it tracks how many CPUs it’s provisioned and changes zone when you’ve hit your limit. Note that not all zones provide access to exactly the same (virtual) hardware and that most are a little bit more expensive than the one chosen, which is why it was chosen in the first place.
  3. Request an increased quota for ‘Google Compute Engine API CPUs (all regions)’ on the Quotas page. This isn’t instant, but it also isn’t something you’ll be waiting months to hear back on.

This sounds really expensive?

As you’re only running the virtual machines when you need them, it actually works out really cheap. Each virtual machine, depending on how you spec it, costs around $0.30 per hour. That means if you’re crawling 10 sites simultaneously and the whole thing takes you 30 minutes, you’ll have only spent $1.50 (plus the cost of 10 Screaming Frog Crawler licenses). As each instance crawls surprisingly fast, you’ll find that entire industries can often be crawled for less than that.

Fixing Google’s Keyword Search Volume Aggregation

If you ask Google Keyword Planner the search volume for ‘cheap windows laptop’ in the US, it’ll tell you that it’s 100-1k – thanks for the help Google! If, instead, you turn to the tools providers you’ll get answers somewhere in the range of 1.5k (Searchmetrics) to 2.9k (SEOMonitor). Yet, what happens when you ask about variations of those keywords – you’ll get this result:

keyword-variations

Clearly, they don’t actually each have the same search volume.

What’s going on?

Since 2012 Google has included, within exact match search volumes, the search volume of misspelt and pluralised close variants of keywords entered. Since March 2017, this was expanded to include alternate orderings of those same keywords (e.g. ‘cheap windows laptops’ and ‘windows laptop cheap’ appearing to have the same volume). Google ignores stop-words (words like ‘as’, ‘in’ and ‘of’) and understands abbreviations (that ‘lol’ is the same as ‘laugh out loud’ for example).

If you’re conducting keyword research and are putting together a list of 50 keywords this is pretty easy to solve by spotting and removing the duplication. When you’re working on a list of tens or even hundreds of thousands of keywords, though, this is practically impossible to do manually.

That means that those keywords could appear in your list with each one showing a search volume of 2.9k – when you add up the total addressable audience you end up with a figure in excess of 11k. That means that any forecast based on that data will skew too high – making what would otherwise be a reasonable forecast potentially unreachable. In tests, we’ve found this to effect anywhere between 0.5% and 10% of search volume depending on where the original keyword list comes from. 10% is the difference between confidently beating target and ending up below target.

Canonical Keywords

How we fixed this is through the concept of a ‘canonical keyword’. This is the simplest form that keyword could take, with all the words in alphabetical order. That means no pluralisation, no conjugation, no misspellings and no pesky word order differences.

It turns out, this sounds a lot easier to implement than it is.

Removing pluralisation is hard because it’s not always a case of removing the ending ‘s’ – see, for example, woman/women, genius/geniuses and tooth/teeth.

There’s no ‘fix all’ button in Excel to fix spelling mistakes and, whilst VBA scripts exist to reorder words in a cell to alphabetical order, those scripts are unwieldy and, frankly at that stage you should be in Python or R in any case.

The Keyword Cleaner

As a result, we built the Keyword Cleaner, which is available for free here.

Screen Shot 2018-03-14 at 10.21.14

Simply enter your keyword list and then click ‘clean’. After a moment (it processes roughly 3k keywords a minute depending on how many people are using it) it’ll give you the canonical version of each ready for you to export.

Next, take those values and add them into a column next to your original keywords in Excel. You’ll then want to see how many times that canonical keyword appears in your keyword list where the search volume and landing pages match (this is to stop decreasing the search volume in cases of a false match). The formula will depend on how you’ve setup your table, though should look roughly like this:

=COUNTIFS([Canonical],[@Canonical],[Search Volume],[@[Search Volume]],[URL],[@URL])

Next, you can simply divide the search volume for each keyword by the number of occurrences of that canonical keyword in the list (as computed above).

Now, obviously, that search volume won’t be accurate on a per keyword basis – we know, for example, that misspellings get roughly 10% of the search volume of the correctly spelt variant. There are two things to remember though: 1) it’s still more accurate than the aggregated volume and 2) this is about getting an accurate forecast based on all the keywords and an accurate total search volume – this solution fixes for that.

In a future version, we’ll likely identify which canonical keywords were fixed misspellings so that you can reduce search volumes accordingly, but that’s for another time and another blog post. Have a play with the tool and leave some feedback below. We’d love to hear your thoughts.

Running Screaming Frog on Google’s Own Servers

Screaming Frog is the industry standard SEO crawler, but it can slow down your computer and monopolise your internet connection. By running it, instead, on Google’s Compute Cloud it can run significantly faster and you’re free to get on with other jobs whilst it runs. Moreover, if you can run one Google Compute Cloud Instance, you can run many.

In this part 1 of 2, I’ll explain how to setup Screaming Frog to run in the cloud and then, in part 2, I’ll explain how to take it to the next level and use it to crawl potentially hundreds of websites, simultaneously, faster than you’ve ever done before.

Before you start this part 1, you’ll need:

  1. a Google account;
  2. a card that Google can charge for use of their servers (you’ll be able to use them for free for a while, but have to enter card details immediately in any case); and
  3. a Screaming Frog license. If you’re planning to simultaneously use Screaming Frog on your computer and the cloud you’ll need two licenses. Screaming Frog is inexpensive – support its continued development.

Now, here’s how to get it running:

  1. Go to the Google Compute Cloud Console and sign up for the free trial.
  2. Create a new project, calling it whatever you’d like.
  3. Click on ‘Compute Engine’ ‘VM instances’ and create a new instance. Here are the setting you’ll need:vm-instance-settings
  4. Once entered, you’ll be back on the virtual machine page and your shiny new virtual machine will begin booting. Click on ‘SSH’ ‘Open in Browser Window’ and you’ll have a box like this appear:9
  5. enter ‘sudo add-apt-repository ppa:webupd8team/java’
  6. enter ‘sudo apt-get update’
  7. enter ‘sudo apt-get upgrade’
  8. enter ‘sudo apt-get install xfce xfce-goodies autocutsel tightvncserver’
  9. enter ‘sudo apt-get install libxss1 libappindicator1 libindicator7 fonts-liberation’
  10. enter ‘sudo apt-get install cabextract enchant fonts-wqy-zenhei gstreamer1.0-plugins-good gstreamer1.0-x hunspell-en-us libaa1 libavc1394-0 libcaca0 libdv4 libenchant1c2a libgstreamer-plugins-good1.0-0 libharfbuzz-icu0 libhunspell-1.6-0 libhyphen0 libiec61883-0 libjavascriptcoregtk-4.0-18 libraw1394-11 libshout3 libspeex1 libv4l-0 libv4lconvert0 libvpx4 libwavpack1 libwebkit2gtk-4.0.37 libwebp6 ttf-mscorefonts-installer zenity zenity-common’
  11. enter ‘sudo apt-get install oracle-java8-installer openjdk-8-jre-headless libatk-wrapper-java-jni libgif7 fonts-dejavu-extra java-common gsfonts-x11 oracle-java8-set-default openjdk-8-jre gsfonts libatk-wrapper-java ca-certificates-java libpcsclite1’
  12. enter ‘touch ~/.Xresources’
  13. enter ‘wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb’
  14. enter ‘sudo dpkg -i google-chrome*.deb’
  15. enter ‘wget https://download.screamingfrog.co.uk/products/seo-spider/screamingfrogseospider_9.1_all.deb’
  16. enter ‘sudo dpkg -i screamingfrogseospider_9.1_all.deb’
  17. enter ‘vncserver’ and set whatever password you like – I recommend not bothering with a view-only password.

You now have everything setup in the cloud – you just can’t access it. To do that you need to  create an SSH connection and install a VNCViewer.

Creating an SSH Connection

We first need to install GCloud SDK.

On Mac OS X

  1. Open Terminal and enter ‘curl https://sdk.cloud.google.com | bash’
  2. enter ‘exec -l $SHELL’
  3. enter ‘gcloud init’
  4. enter ‘gcloud login’

On Windows

  1. Download and run the Cloud SDK Installer.
  2. Choose ‘start Cloud SDK Shell’ and ‘Run gcloud init’
  3. In command prompt, run ‘gcloud login’

Once the above is complete, enter ‘gcloud compute ssh [the name of your virtual machine] –project [the unique name of your project] –zone us-central1-f –ssh-flag “-L 5901:localhost:5901”

It will ask you to set a password and then the connection will be made.

Installing & Setting up VNCViewer

  1. I like RealVNC’s VNC Viewer, though it doesn’t really matter which you use. You can download it here.
  2. Once installed/opened, click on ‘File’ ‘New’ and enter ‘127.0.01:5901’ as the VNC Server – name it whatever you like.
  3. Once saved, click on the server’s icon and type in the password you set when you typed ‘vncserver’ earlier. A window should then open connecting you to your virtual machine:23

And there you have it – you’re all set to run your first crawl. One important thing to remember is that you’re paying (or using your free trial credits) whilst the computer is running (and only a tiny amount when it’s not) so, when you’re done with the VM, shut down the terminal window, VNC viewer and in the Google Cloud Compute Console, stop the machine.

You’re ready to crawl hundreds of thousands of URLs without having to worry about it slowing down your internet connection or computer. BUT I wouldn’t recommend stopping here. Ultimately the above is a lot of work if all you’re trying to do is stop colleagues complaining about you bogging down the connection. The real power of this doesn’t come from running a single one – it happens when you automate the whole thing, running multiple instances at the same time, giving you a huge speed boost that’ll allow you to crawl entire industries faster than you’ve previously crawled single sites. Luckily, that’s a lot easier than it sounds. So jump over to part 2 here to turn your basic setup into something that’ll change the way you work.

The rarest coins since 2008!

Dig into your pockets ladies and gents, because the loose change jingling around in there could be worth a mighty mint! To be clear, we’re not just talking 100-year old coins here, it could be coins from the last ten years that could make you that little bit richer.

Here’s a breakdown of the 5 rarest coins since 2008 that could be hiding down the back of the sofa or saved in your penny jar, so get looking:

The undated 20p from 2008

Used to be worth: 20p.    Now: valued at £100

This error coin has gone up by some 40,000% and has to be perhaps the rarest coin jingling around in your holiday fund jars. Caused by wrongly paired obverse and reverse dies, 200,000 of these ‘mule’ 20p coins were circulated

The 2009 Kew Gardens 50p

Used to be worth: 50p.     Now: £30

This really is one of the rarest, with only 210,000 in circulation, the design on it (the Chinese Pagoda) goes back as far as 1761, and is a real precious find if you happen across one.

 The (Coloured!) Peter Rabbit 50p Coin from 2016

Used to be Worth: 50p.     Now: up to £800

Averaging between £500-£800, these lovely coins are technically a coin collectors dream, they sold out (unsurprisingly, due to their beautiful colouration), due to the Beatrix Potter design and specialist sellers can (and do) sell them for well over the asking price, somewhere in the £600 region.

The ‘Abolition of the Slave Trade’ Elizabeth II 2007 £2 coin

Used to be worth: £2        Now: £1,100

This £2 coins’ error lies in the wording on it: instead of being inscribed with “AM I NOT A MAN AND BROTHER” it actually says “UNITED INTO ONE KINGDOM”, which was supposed to be the inscription to be used for the Act of Union Anniversary Coin.

The Olympic Swimming 50p Coin from 2012

Used to be worth: 50p.      Now: approx. £1,000

A total of 29 coins were released in 2012 to commemorate the 2012 London Olympics, with designs incorporating 29 different sports, an error occurring with only 1, the Aquatics 50p coin. Orniginally, the coin showed water covering the swimmers face, however, it was decided that the swimmers face should show better in later coins. It is unknown quite how many there are in circulation with the original print, and they’re harder to see because of the subtle difference, so eyes sharp and do that double take, it will be well rewarded at approx. £1000!

How We Got Guardians of the Galaxy Director, James Gunn, To Comment on Our Campaign

When James Gunn, Director of Guardians of the Galaxy, spent two hours on Twitter discussing one of our recent creative pieces, we knew it had gone viral. Following the success of Directors Cut, we’d like to share a bit about how we got there.

Directors-Cut-Campaign-for-Gocompare.com

Made for our client, Gocompare.com, the piece racked up some nice numbers to go along with James Gunn’s comments on Twitter and Facebook:

  • Over 48k shares in 48 hours
  • More than 437 pieces of coverage, 90% linking directly to the piece (of which, less than 5% were no-followed) and about 40% directly linking to Gocompare.com’s life insurance product page
  • Coverage on big name sites like The Independent, NME, IGN, Entertainment Weekly and The Guardian
  • Viewing it through PR metrics – coverage on sites with a combined reach of nearly 400 million people, producing an AVE of £3.8 million.

All of the above happened within the first week of the campaign going live.

The Concept

Life insurance is a tricky subject to create content around due to the relatively serious nature of the product itself. We’re always looking to create content that covers multiple interest areas so that we can outreach in multiple verticals, so talking about fictional deaths in movies seemed to solve two problems at the same time.

We quickly came across a fan-made piece that had rounded up on-screen deaths and knew immediately that we were on to something:

Despite being a simple (maybe even ugly) graph, the above content managed to achieve some organic coverage on Business Insider, where it was viewed over 42k times. When this came out in early 2014 it clearly resonated with its audience despite the deadliest film being, unsurprisingly, Lord of the Rings: Return of the King, which is pretty much 3 hours of war.

We realised that by refreshing the data we would likely end up with a whole new list (there were some films since 2014 that we were just itching to add). By playing with the data we knew we could pull out new angles and by working on the design we could make it more accessible.

The Data

As much as we wanted to, binge watching movies wasn’t going to be an economical way to get the information. The original piece had pulled the information from a forum and so our original plan was to do the same. We updated the information and then ran until one of the biggest stumbling blocks of the piece – the data had hardly changed. At some point in late 2013, the forum had lost popularity and hardly any new films had been added since.

At that point we had a choice – either ditch the piece, go ahead with the information we had or find another solution. As professional maze walkers we naturally chose the latter. Through relatively extensive searching we found another option – a mixture of YouTube videos and forums that had newer counts, using the same methodology. By merging this data together, and adding in a serious chunk of validation, we now had a fresh perspective and interesting data we could use. Rising as the standout story from it was Guardians of the Galaxy in first position. Who would have thought that a film that features virtually no bloodshed – a Disney film – would be the deadliest? Not only do comic book/ Marvel movies practically have their own sub-genre (and separate list of fan sites) online, but Guardians of the Galaxy 2 is currently in production, making the campaign timely.

The Execution

At its heart the piece is a listicle; but if that’s all it became the best execution may have just been as a press release. What value could we add? We decided that, by adding information about each film in our top-ten, as well as imagery, we could do all the research for journalists, making the piece incredibly easy for journalists to write-up.

Meanwhile, we’d finished playing with the data, looking for secondary angles. We found interesting stories around years (“are films getting deadlier”), genre (“are horror films really the most deadly”) and ratings (“deadliest films aren’t necessarily R-rated”). We doubled-down on the execution to add a way to explore these secondary angles. These hooks were implicit so that we could take different angles to different journalists.

The outcome was:

  • Simple – Guardians of the Galaxy Deadliest Film Ever
  • Unexpected – it’s a family movie
  • Concrete – We tell you exactly how many deaths and death, in itself, is a pretty concrete concept; and
  • thanks to detailed supporting information we made available to journalists, Credible.

We knew that readers were likely to only read the headline, which meant that it was inevitable fiery debate would occur as to why the most deadly film wasn’t, instead, something like Star Wars where whole planets blew up. This emotional element would help to propel the campaign through social media, making it highly shareable.

Outreach

We don’t waste our time with bloggers and mid-tier publications. We believe that taking our campaigns to the largest sites in the world gives them access to huge audiences, which in turn leads to secondary pick up and huge results. This was no exception. What better place for a piece that talked about Guardians of the Galaxy than The Guardian? It’d give the piece a credibility boost, access to a huge audience and is a great link to have within itself. So, as the first site to be outreached to it was also the first coverage to go live:

Guardian coverage of Directors Cut

Here’s a little secret – it also went live on NME and the Independent almost simultaneously. By making sure it appeared on multiple large sites at the same time we were hoping to begin an avalanche of coverage.

That’s when this happened:

James Gunn comments on Directors Cut

James would spend over two hours on Twitter, sending over 50 tweets, talking about our campaign. It suited his agenda for the story to blow up as much as possible because he was busy feeding the publicity machine around the new movie. Meanwhile, the story was breaking elsewhere, gaining over 900 upvotes on Reddit and the social pressure from the coverage was creating a Twitter Moment:

Directors Cut Became a Twitter Moment

Our outreach team quickly picked up James Gunn’s tweet and added it into their outreach emails – providing credibility to the piece and a further angle. In fact, James’ tweets were so noteworthy, many of the publications that had already covered the piece went back and covered it a second time. So far, within two weeks of launching, it has appeared on Yahoo a total of four times.

Soon, the piece had been covered on over 400 sites including:

Coverage Highlights

Our Outreach team kept ahead of the coverage and, because we speak 12 different languages, we were able to take it to publications all around the world; starting new fires of publicity across the globe (including in France, Finland, Russia and Brazil).

Overall, we smashed all targets set for this piece by starting with a simple concept, producing new and surprising data, executing like a boss and staying nimble with outreach, getting coverage from some of the best sites in the world.