Showing posts with label python. Show all posts
Showing posts with label python. Show all posts

Tuesday, April 19, 2011

Code: a random web crawler

This code crawls the web to generate a pseudo-random sample of web sites. Not my prettiest code, but it works and may save somebody an afternoon of coding.

random_crawler.tar.gz (14MB, zipped)
This script explores the web for a pseudo-random sample of sites. The crawl proceeds in a series of 100 (by default) waves. In each wave, 2,000 (by default) crawlers attempt to download pages. When page downloads are successful, all the unique outbound hyperlinks are stored to a master list. Then sites for the next round are sampled.

Sites from common domain names are undersampled, to avoid getting stuck (e.g. within specific social networking sites). When sites are selected for the next round, weights are equal x^(3/4), where x is the number of sites in the same domain.

After several waves, the sample should be well-mixed -- a pseudo-random sample of the web.

Note: This code is kind of sloppy, does not scale well, and is poorly commented. Bad code! Bad!