Sep
29
2007
I needed a list of websites of all the Fortune 500 from 2007 for my masters project. Unfortunately, Fortune wanted to charge me hundreds of dollars to get some fancy excel spreadsheet with much more information than I really needed. I suspect there are other people out there who might find this list useful, so I’ll share how I made it (in case you want more than the URLs). However, you can skip all the following and just download The 2007 Fortune 1000 Website List. Also, here is a zip file of the 2007 Fortune 1000 HTML files found at money.cnn.com.
Using wget, you can download each of the 1000 . The URL links seem to be slighly random, but they are found between 1.html and 5000.html. Thankfully, wget just ignores saving 404 error pages. So, we download all the links in an empty directory:
- for i in `seq 1 5000`; do wget http://money.cnn.com/magazines/fortune/fortune500/2007/snapshots/$i.html ; done
Ah, that’s nice - we somehow have some extras. After some analysis, this will fix the problem:
- for i in `fgrep xxxxx *|awk ‘{ print $1 }’|awk -F “:” ‘{ print $1 }’|sort|uniq`;do rm $i;done
Woohoo! 1000 html files - perfect!
- cat *.html|grep Website|grep headersmtext|awk ‘{print $4}’|awk -F “\”" ‘{ print $2 }’|sort|uniq > output.txt
Oddly, you’ll only end up with 996 unique urls because the following are duplicates (which is correct):
- http://www.cvscaremark.com
- http://www.fcx.com
- http://www.integrysgroup.com
- http://www.oshkoshtruckcorporation.com
2 comments | posted in Projects, School
Sep
14
2007
I had trouble compiling packit today, a network auditing tool that allows one to define (spoof) nearly all TCP, UDP, ICMP, IP, ARP, RARP, and Ethernet header options to test firewalls, intrusion detection/prevention systems, port scanning, simulating network traffic, and general TCP/IP auditing. I’m running OpenSUSE 10.1 and the compilation failed on a #include <net/bpf.h>.
I found a solution to this problem on Jeff Terrell’s site, but here it is in a nutshell. Assuming you have libpcap already installed, all you should have to do is:
- cp /usr/include/pcap-bpf.h /usr/include/net/bpf.h
Alternatively, edit the header file with the problem and point it in the correct location.
1 comment | posted in Ideas, Projects
Sep
11
2007
My masters project involves testing a large number of webservers. In order to test a large number of webservers I would obviously need a large list of them - at least 100,000. I got such a list by applying as a researcher with http://www.ircache.net. They offer (free for researchers, pay for others) trace files from their Squid public proxy servers situated in the US. What this means is that when people use their free proxy servers, they anonymously log all the URLs people visit. In fact, the anonymization process is fairly interesting in that they MD5 encode all the POST/GET variables and assign a random but consistent IP address to the client…but enough about that and back to the task at hand.
Once you get an account for the trace files (read the FAQ and email them), you can download all the trace files. They will recommend you use the command line linux “ftp” client, but I’d recommend downloading the command line linux “ncftp” program. This will allow you to download an directory at once rather than each file one-by-one.
So:
- mkdir traces;cd traces
- ncftp -u youruser -p yourpassword ftp.ircache.net
- get Traces/*
- quit
- gunzip *.gz
And here is the voodoo magic command I wrote that turns all those trace files into a single list of unique, ordered, web urls (with no POST/GET data):
- cat *|grep 200 |grep http://|awk ‘{ print $7 }’|awk -F/ ‘{ print $1 “//” $3 }’|sort|uniq > urls.txt
The urls.txt file which is generated (after a LONG time) contains a single url, such as http://www.google.com, per line. In total, this gave me about 200,000 unique urls. IRCache.net only keeps one week worth of data on their server at a time, so by downloading new traces each day over a larger period of time, you could acquire an even larger number of urls fairly quickly.
no comments | posted in Ideas, Projects
Sep
11
2007
Part of my masters project work involves using network measurement tools to garner information about a path to a website. One useful type of tool that I don’t believe is used that often is a bandwidth estimation tool. These type of tools employ one of a variety of methods to estimate the available bandwidth between each TTL hop along a given router path to a host. To learn more about these tools, including clink, I recommend reading “Creating a Bandwidth Estimation Testbed Summer 2001 Status Report.”
One of these tools, Clink, was written by Allen Downey and has made significant improvements to Van Jaconbson’s similar tool, pathchar. Unfortunately, I noticed a problem where clink seemed to hang on certain hosts. I don’t believe I am alone in reporting this problem. In, “Measuring Bandwidth between PlanetLab Nodes” (PDF Link) as published in the proceedings of PAM 2005 – Passive & Active Measurement Workshop, the researchers noticed that clink would hang on PlanetLab’s machines and attributed the hang to a possible Linux kernel version problem. It is possible the kernel is the case, but I ran into another situation where clink would experience what looked like a program hang and might explain their hang as well.
When clink experiences a timeout on a probe to a TTL hop, it simply retries the probe again. Of course, if the router has been setup to not respond to UDP packets as many routers in todays internet are now setup to do, clink will endlessly try probing the router with no success. To the end user, this looks like a hang, but a tcpdump will confirm clink is still firing off the same UDP packet probe over and over. When clink was written in 1998-99, many routers were configured to (nicely) respond to a probe, but this is not the case any more.
Because I found clink’s bandwidth estimation using the even-odd technique even-odd technique, as described in the SIGCOMM paper, to be the best available, I rewrote part of the code to fix the infinite loop bug caused by router timeouts. I introduced two new program arguments. The first is a maximum probe retry value and the second being a maximum probe failures per TTL hop. Therefore, you could retry a probe of a specific size against a specific TTL hop multiple times using the first argument before declaring the probe a failure. Then, if the number of probe failures on a specific TTL hop exceed the second argument, the TTL hop is simply indicated as failed and is skipped. Clink then goes on to measure the rest of the hops.
I am not publishing the code patches yet as I am still testing it, but if you are interested in taking a peak at it, please comment and I’ll email you a copy.
3 comments | posted in Ideas, Projects, School
Sep
11
2007
After spending four months in India, London was a night and day difference. The air was not clean but rather was filled with the scent of western cologne. I can’t say I enjoyed it any more than the dung and garbage lots I found in India, but at least this scent was sanitary. In fact, sanitation was really a novelty to me. The hotel I stayed at with my parents was…spotless. Of course, at the price I was paying, one could nearly construct a new hotel India (no, but it really is expensive). The cab ride from the airport to the hotel alone was over 60 times more expensive than in India. It just blew my mind.
Once you get past the price, London is just like any other city. We went shopping, saw Spamalot (hilarious), and did the usual tourists attractions. I really enjoyed the British Museum and our day trips out to see Stonehenge, Bath, Oxford, and Stratford. I spent the week catching up on all the sleep I had lost as well as soaking in the luxeries I had nearly forgotten about. These included constant electricity, fast internet, temperature controlled rooms, hot showers with shower heads, toilet paper (oh, how I missed it), clean clothes, and mattresses with clean sheets. To be honest, all of these luxuries are possible and certainly cheaper in India, but were out of my price range while trying to live off of my Indian salary. That last part is the key difference.
And, while I have left out many stories, that concluded my summer. I am now back in Cleveland toiling endlessly away on my masters project during my last semester. I act like life is hard, but I know it isn’t. I have embraced the lifestyle I all but abandoned while in India, but certain things have remained. I have been cooking a lot more Indian dishes now that I know how they are supposed to taste. I notice Indian people a lot more (and I sometimes will interject a “shukriya” in conversation to see if they notice). I am applying to full time positions, but otherwise, life is quiet and a bit lonely. Alvida, summer.
no comments | posted in Life, Travel