The 2007 Fortune 1000 Website List
I needed a list of websites of all the Fortune 500 from 2007 for my masters project. Unfortunately, Fortune wanted to charge me hundreds of dollars to get some fancy excel spreadsheet with much more information than I really needed. I suspect there are other people out there who might find this list useful, so I’ll share how I made it (in case you want more than the URLs). However, you can skip all the following and just download The 2007 Fortune 1000 Website List. Also, here is a zip file of the 2007 Fortune 1000 HTML files found at money.cnn.com.
Using wget, you can download each of the 1000 . The URL links seem to be slighly random, but they are found between 1.html and 5000.html. Thankfully, wget just ignores saving 404 error pages. So, we download all the links in an empty directory:
- for i in `seq 1 5000`; do wget http://money.cnn.com/magazines/fortune/fortune500/2007/snapshots/$i.html ; done
Ah, that’s nice - we somehow have some extras. After some analysis, this will fix the problem:
- for i in `fgrep xxxxx *|awk ‘{ print $1 }’|awk -F “:” ‘{ print $1 }’|sort|uniq`;do rm $i;done
Woohoo! 1000 html files - perfect!
- cat *.html|grep Website|grep headersmtext|awk ‘{print $4}’|awk -F “\”" ‘{ print $2 }’|sort|uniq > output.txt
Oddly, you’ll only end up with 996 unique urls because the following are duplicates (which is correct):
- http://www.cvscaremark.com
- http://www.fcx.com
- http://www.integrysgroup.com
- http://www.oshkoshtruckcorporation.com
I too wanted the Fortune 1000 URLs for a little project and now I’ve got them. A few tweaks and I’ll have the company names too. Thanks for providing this starting point!