Getting large list of URLs for testing using Squid logs


My masters project involves testing a large number of webservers. In order to test a large number of webservers I would obviously need a large list of them - at least 100,000. I got such a list by applying as a researcher with http://www.ircache.net. They offer (free for researchers, pay for others) trace files from their Squid public proxy servers situated in the US. What this means is that when people use their free proxy servers, they anonymously log all the URLs people visit. In fact, the anonymization process is fairly interesting in that they MD5 encode all the POST/GET variables and assign a random but consistent IP address to the client…but enough about that and back to the task at hand.

Once you get an account for the trace files (read the FAQ and email them), you can download all the trace files. They will recommend you use the command line linux “ftp” client, but I’d recommend downloading the command line linux “ncftp” program. This will allow you to download an directory at once rather than each file one-by-one.

So:

  • mkdir traces;cd traces
  • ncftp -u youruser -p yourpassword ftp.ircache.net
  • get Traces/*
  • quit
  • gunzip *.gz

And here is the voodoo magic command I wrote that turns all those trace files into a single list of unique, ordered, web urls (with no POST/GET data):

  • cat *|grep  200 |grep http://|awk ‘{ print $7 }’|awk -F/ ‘{ print $1 “//” $3 }’|sort|uniq > urls.txt

The urls.txt file which is generated (after a LONG time) contains a single url, such as http://www.google.com, per line. In total, this gave me about 200,000 unique urls. IRCache.net only keeps one week worth of data on their server at a time, so by downloading new traces each day over a larger period of time, you could acquire an even larger number of urls fairly quickly.

Information and Links

Join the fray by commenting, tracking what others have to say, or linking to it from your blog.


Other Posts
Trouble compiling packit - need net/bpf.h
Bandwidth Estimation using Clink - hang fix

Write a Comment

Take a moment to comment and tell us what you think. Some basic HTML is allowed for formatting.

Reader Comments

Be the first to leave a comment!