Robots.txt
This project measures the active use of the robot exlclusion
standard on the internet. We apply the figures we discover
to the performance of client and proxy side web prefetchers.
Documents
- Progress Report I - doc
- Progress Report II - ppt
- Final Paper - doc
- Final Presentation - ppt
Data
Tools
The following scripts were used for the data collection and
analysis phase of the project. Note that most are quick and
dirty scripts written in Perl - so they may or may not be
understandable/useful.
- rdf2txt.pl - takes the raw rdf database data (no structure data) and converts it to a newline-delimited list of websites. Category information is preserved as "comments".
- filtertxt.pl - filters the txt produced by rdf2txt. Strips out obviously foreign websites and specific web pages (as opposed to web sites). It also ensures that all websites end with a '/' - which is not guaranteed by the rdf data.
- zapduplicates.pl - filters the data provided by filtertxt.pl to eliminate any duplicate websites. Technically, this should have been part of filtertxt.pl.
- splitwork.pl - splits workload (websites) evenly across a specified number of files. Use this script if you will be running multiple robots in parallel.
- robot.pl - the robot that does all the data gathering. It operates in two modes - first is preliminary scanning (is the website up? do they potentially have a robots.txt file?) and the other is download (if you find a robots.txt file on the website, download it for later analysis).
- analyze.pl - calculate useful information from the output of the robot.
- aggregate.pl - calculate useful information from the robots.txt files downloaded by the robot (assumed to all be in one download directory).
Up Home