Verve Search logo

Comparing Screaming Frog Crawl Files

James Finlayson

James Finlayson

May 21, 2018

Handing over technical recommendations often comes with some trepidation; how long might it take for them to be implemented, will they be implemented at all and, if so, will they be implemented correctly? That’s why understanding how development cycles occur, how items are prioritised and who you need to get onside is as key to successful technical SEO as the recommendations themselves. However well you understand those, though, changes are often implemented without any feedback that they’re now complete.

It’s for that reason that tools like ContentKing have sprung up; to keep an eye on the site and alert you of changes. It’s not always feasible to run SaaS crawlers on the site, though. As a result, many of us rely on running crawls with Screaming Frog’s crawler. Comparing crawl files can be a pain. Usually, you’ll end up dumping the data into excel and run a bunch of VLOOKUPS or MATCH/INDEX functions only to find that no, the developer hasn’t implemented the changes.

Meanwhile, you’ll occasionally want to compare crawl files of different sites to:

  1. Compare a dev environment with a staging environment
  2. Make sure content has been ported to a new site correctly
  3. Run technical SEO competitive analysis/comparisons – we wrote about this recently here.

This has always been a pain, which is why, for a while now, we’ve had a tool that quickly compares crawl_overview files for us. Today, we’re making it available for free.

It’s a simple Python script. If you don’t have Python installed, you can read a guide for Windows here and for MacOS here (you’ll need Python 2, rather than 3, for the script to work – though feel free to install both using virtual environments if you’re really keen on 3). The script itself, is here:

import pandas
import csv
import sys

from tqdm import tqdm


class color:
   PURPLE = '33[95m'
   CYAN = '33[96m'
   DARKCYAN = '33[36m'
   BLUE = '33[94m'
   GREEN = '33[92m'
   YELLOW = '33[93m'
   RED = '33[91m'
   BOLD = '33[1m'
   UNDERLINE = '33[4m'
   END = '33[0m'


def main(argv):
	if len(argv) != 4:
		print 'Usage: programname.py crawl_overview1.csv crawl_overview2.csv output.csv'
		sys.exit()

	headerrows = 5
	endline = 191

	fileone = get_csv(argv[1])
	filetwo = get_csv(argv[2])

	fileone = fileone[0:endline]
	filetwo = filetwo[0:endline]

	fileonesite = fileone[1][1]
	filetwosite = filetwo[1][1]

	fileone = fileone[headerrows:]
	filetwo = filetwo[headerrows:]

	fileonedata = []
	filetwodata = []
	combineddata = []
	firstcolumn = []

	firstcolumn.extend(get_column(fileone,0))
	fileonedata.extend(get_column(fileone,1))
	filetwodata.extend(get_column(filetwo,1))
	combineddata.extend(zip(firstcolumn,fileonedata,filetwodata))


	outFile = csv.writer(open(argv[3], 'w'))
	outFile.writerow(["",fileonesite,filetwosite])
	for i in tqdm(combineddata):
		outFile.writerow(i)

	if fileonedata == filetwodata:
		print (color.BOLD + color.RED + "Crawl files are identical" + color.END)
	else:
		print (color.BOLD + color.GREEN + "Crawl files are NOT identical" + color.END)

def get_csv(thefile):
	datafile = open(thefile, 'r')
	datareader = csv.reader(datafile, delimiter=",")
	data=[]
	for row in tqdm(datareader):
		data.append(row)
	datafile.close()
	return data

def get_column(thelist,thecolumn):
	newlist =[]
	for row in tqdm(thelist):
		if len(row) >= thecolumn +1:
			newlist.append(row[thecolumn])
		else:
			newlist.append("")
	return newlist

if __name__ == '__main__':
  main(sys.argv)

The only thing you might need to pip install is tqdm – which if you’re not already using we heartily recommend – it creates the nice little loading bars. If you’re new to Python and the script errors when you run it, mentioning tqdm, simply type:

pip install tqdm (on windows)

sudo pip install tqdm (on Mac)

You’ll only ever need to do that once.

Save it in a folder, navigate to that folder using command prompt or terminal and then run it the same way you’d run any Python script (typically ‘Python <nameoffile.py>’). It takes two inputs:

  1. The name of the first crawl_overview file
  2. The name of the second crawl_overview file
  3. The name of file you’d like to save the output as – it should be a csv, but doesn’t need to already exist

Both files should be in the same folder as the Python script and so a valid input would look something like this:

Python crawl_comparison.py crawl_overview1.csv crawl_overview2.csv output.csv

Compare Screaming Frog Crawl Files

The script’s pretty fast – it’ll chew through the files within seconds and then report that either ‘Crawl files are identical’ or ‘Crawl files are NOT identical’. It will have saved a file called ‘comparison.csv’ in the same directory that compares both crawl files – ready for you to:

  1. Send onwards as proof as to whether recommendations have or haven’t been implemented; or
  2. Create industry comparison graphs to show how the sites compare; or
  3. do with as you please.

comparison-output-csv

Future Updates

Now that the script is publicly available there are a few changes we plan to make to it. These include:

  1. Creating a front-end and installer for those who don’t like to mess around with Python
  2. Allowing for the comparison of multiple crawl_overview files at once
  3. Allowing for the comparison of other Screaming Frog outputs – not just crawl_overview files.

We’d love your feedback as to what features you’d like to see added.

Read another blog post