Comparing Screaming Frog Crawl Files

Handing over technical recommendations often comes with some trepidation; how long might it take for them to be implemented, will they be implemented at all and, if so, will they be implemented correctly? That’s why understanding how development cycles occur, how items are prioritised and who you need to get onside is as key to successful technical SEO as the recommendations themselves. However well you understand those, though, changes are often implemented without any feedback that they’re now complete.

It’s for that reason that tools like ContentKing have sprung up; to keep an eye on the site and alert you of changes. It’s not always feasible to run SaaS crawlers on the site, though. As a result, many of us rely on running crawls with Screaming Frog’s crawler. Comparing crawl files can be a pain. Usually, you’ll end up dumping the data into excel and run a bunch of VLOOKUPS or MATCH/INDEX functions only to find that no, the developer hasn’t implemented the changes.

Meanwhile, you’ll occasionally want to compare crawl files of different sites to:

Compare a dev environment with a staging environment
Make sure content has been ported to a new site correctly
Run technical SEO competitive analysis/comparisons – we wrote about this recently here.

This has always been a pain, which is why, for a while now, we’ve had a tool that quickly compares crawl_overview files for us. Today, we’re making it available for free.

It’s a simple Python script. If you don’t have Python installed, you can read a guide for Windows here and for MacOS here (you’ll need Python 2, rather than 3, for the script to work – though feel free to install both using virtual environments if you’re really keen on 3). The script itself, is here:

import pandas
import csv
import sys

from tqdm import tqdm


class color:
   PURPLE = '33[95m'
   CYAN = '33[96m'
   DARKCYAN = '33[36m'
   BLUE = '33[94m'
   GREEN = '33[92m'
   YELLOW = '33[93m'
   RED = '33[91m'
   BOLD = '33[1m'
   UNDERLINE = '33[4m'
   END = '33[0m'


def main(argv):
	if len(argv) != 4:
		print 'Usage: programname.py crawl_overview1.csv crawl_overview2.csv output.csv'
		sys.exit()

	headerrows = 5
	endline = 191

	fileone = get_csv(argv[1])
	filetwo = get_csv(argv[2])

	fileone = fileone[0:endline]
	filetwo = filetwo[0:endline]

	fileonesite = fileone[1][1]
	filetwosite = filetwo[1][1]

	fileone = fileone[headerrows:]
	filetwo = filetwo[headerrows:]

	fileonedata = []
	filetwodata = []
	combineddata = []
	firstcolumn = []

	firstcolumn.extend(get_column(fileone,0))
	fileonedata.extend(get_column(fileone,1))
	filetwodata.extend(get_column(filetwo,1))
	combineddata.extend(zip(firstcolumn,fileonedata,filetwodata))


	outFile = csv.writer(open(argv[3], 'w'))
	outFile.writerow(["",fileonesite,filetwosite])
	for i in tqdm(combineddata):
		outFile.writerow(i)

	if fileonedata == filetwodata:
		print (color.BOLD + color.RED + "Crawl files are identical" + color.END)
	else:
		print (color.BOLD + color.GREEN + "Crawl files are NOT identical" + color.END)

def get_csv(thefile):
	datafile = open(thefile, 'r')
	datareader = csv.reader(datafile, delimiter=",")
	data=[]
	for row in tqdm(datareader):
		data.append(row)
	datafile.close()
	return data

def get_column(thelist,thecolumn):
	newlist =[]
	for row in tqdm(thelist):
		if len(row) >= thecolumn +1:
			newlist.append(row[thecolumn])
		else:
			newlist.append("")
	return newlist

if __name__ == '__main__':
  main(sys.argv)

The only thing you might need to pip install is tqdm – which if you’re not already using we heartily recommend – it creates the nice little loading bars. If you’re new to Python and the script errors when you run it, mentioning tqdm, simply type:

pip install tqdm (on windows)

sudo pip install tqdm (on Mac)

You’ll only ever need to do that once.

Save it in a folder, navigate to that folder using command prompt or terminal and then run it the same way you’d run any Python script (typically ‘Python <nameoffile.py>’). It takes two inputs:

The name of the first crawl_overview file
The name of the second crawl_overview file
The name of file you’d like to save the output as – it should be a csv, but doesn’t need to already exist

Both files should be in the same folder as the Python script and so a valid input would look something like this:

Python crawl_comparison.py crawl_overview1.csv crawl_overview2.csv output.csv

Compare Screaming Frog Crawl Files

The script’s pretty fast – it’ll chew through the files within seconds and then report that either ‘Crawl files are identical’ or ‘Crawl files are NOT identical’. It will have saved a file called ‘comparison.csv’ in the same directory that compares both crawl files – ready for you to:

Send onwards as proof as to whether recommendations have or haven’t been implemented; or
Create industry comparison graphs to show how the sites compare; or
do with as you please.

comparison-output-csv

Future Updates

Now that the script is publicly available there are a few changes we plan to make to it. These include:

Creating a front-end and installer for those who don’t like to mess around with Python
Allowing for the comparison of multiple crawl_overview files at once
Allowing for the comparison of other Screaming Frog outputs – not just crawl_overview files.

We’d love your feedback as to what features you’d like to see added.

Verve Search Introduces: The LinkScore Tool

Want to understand the real value of the links you’re building?
Here at Verve Search, for the past five years, we’ve been developing a proprietary metric to do just that. Up to now, we’ve kept it exclusively for our clients, but, in the interests of transparency, and for the benefit of the industry as a whole, in the back end of last year we took the decision to build a free to use, public version.

After a whole bunch of work, we’re delighted to say it’s live and ready for you to play with.

For those of you who just want to get their mitts on it, it’s right here.

For those who want to learn more about how it was developed, read on!

Why did we build the LinkScore tool?

A single metric might not always tell the full story!

We’d always felt that there was probably little point (from a rankings perspective) in having a link on an amazingly authoritative domain if it’s no-followed and in a language that neither you nor your customers speak. Yet, if you use a single metric to determine the authority of a link you may find that you’d be treating those links as if they were of equal value.

As such, rather than use a single metric, our tool blends more than 10 different on and off-site metrics, in order to assign a value to a link.

We needed an international metric…
We found that many SEO tools that assign link metrics are primarily focussed on English-speaking audiences. So, whilst their metrics might work well in primarily English-speaking countries, that might not always be the case in countries where English is not the native language. Therefore, we built the LinkScore to provide scores that give equal value to equivalent authoritative sites in each country – meaning quality links in one country are assigned an appropriate value.

We wanted a tool which could evolve & keep pace with the industry!
Each of the different variables added into the LinkScore were chosen based on our own testing and benchmarking. Where we’re using third party metrics we felt it was important that we weren’t tied to one particular database, and as a result we’ve been able to choose multiple best-in-class metrics that get us as close as possible to measuring the true ranking value of a link. Over the years the LinkScore tool has continually evolved alongside this fast-paced industry.

What does the LinkScore tool do?
It allows you to measure a link’s ability to influence rankings. It also allows links to be compared with each other, and groups of links to be compared periodically. Please note, we built this as an SEO tool, and as such, the tool does not take into account the value a link might provide in terms of PR, branding or any other type of marketing.

When you run a link through the tool a score between 0-500 will be returned. This scale is not logarithmic, however some of the variables used to calculate the score are.

Semantically relevant, followed, in-content links in unique content on authoritative domains yield the highest scores. Example sites which would yield high scores include the BBC and the New York Times.

How are scores calculated?

We could tell you, but we’d have to kill you! Kidding 🙂

We keep the exact metrics, and how they are combined, a closely guarded secret. This is to stop people gaming the algorithm, because it is updated annually and because we think the accuracy of the final scores speaks more to the quality of the LinkScore than any particular one of its metrics.

What do I need to do to use the tool?
You’ll need to input your Majestic, Dandelion and SEMRush API credentials and you’re good to go. Why? Well, the LinkScore uses metrics from each of these providers as part of its algorithm. To prevent abuse of the tool, we require users to use their own API accounts rather than providing free access to our own. Rest assured, your API credentials are stored locally on your computer; we do not keep a copy of your API credentials, nor do we use them for any purpose other than analysing the links you add to the LinkScore tool.

How much does the LinkScore tool cost?
Except to the extent that it uses your third-party API credits, the LinkScore tool is free to use. Rate limiting may, however, be put in place to maintain the experience for all users.

Do we store your data?
Definitely not! We do not store your API credentials, the links that you run through the LinkScore tool, or the score output. However, we do run Google Analytics and so store a number of different metrics related to your visit including, but not limited to your location, browser, time on site and pages visited.

Got more questions? Check out our FAQ, or contact us.

And if you do play with the tool, do let us know what you think.

Comparing Screaming Frog Crawl Files

Future Updates

Verve Search Introduces: The LinkScore Tool

Contact us