Getting large list of URLs for testing using Squid logs
My masters project involves testing a large number of webservers. In order to test a large number of webservers I would obviously need a large list of them - at least 100,000. I got such a list by applying as a researcher with http://www.ircache.net. They offer (free for researchers, pay for others) trace files from their Squid public proxy servers situated in the US. What this means is that when people use their free proxy servers, they anonymously log all the URLs people visit. In fact, the anonymization process is fairly interesting in that they MD5 encode all the POST/GET variables and assign a random but consistent IP address to the client…but enough about that and back to the task at hand.
Once you get an account for the trace files (read the FAQ and email them), you can download all the trace files. They will recommend you use the command line linux “ftp” client, but I’d recommend downloading the command line linux “ncftp” program. This will allow you to download an directory at once rather than each file one-by-one.
So:
- mkdir traces;cd traces
- ncftp -u youruser -p yourpassword ftp.ircache.net
- get Traces/*
- quit
- gunzip *.gz
And here is the voodoo magic command I wrote that turns all those trace files into a single list of unique, ordered, web urls (with no POST/GET data):
- cat *|grep 200 |grep http://|awk ‘{ print $7 }’|awk -F/ ‘{ print $1 “//” $3 }’|sort|uniq > urls.txt
The urls.txt file which is generated (after a LONG time) contains a single url, such as http://www.google.com, per line. In total, this gave me about 200,000 unique urls. IRCache.net only keeps one week worth of data on their server at a time, so by downloading new traces each day over a larger period of time, you could acquire an even larger number of urls fairly quickly.