Package comirva.web.crawling.agmis

Class Summary
CrawlListCreator Creates a list of all URLs to be crawled from the info.xml files.
CrawlListManager This class manages the maintainance of the list of URLs to fetch.
DownloadControlData Class to hold the data structure to manage downloads (especially to ensure minimal time limits between querying the same host address).
DownloadControlDataVector Extension to Vector to cope with special requirements of the DownloadControlData.
ExaleadRetriever This class automatically queries exalead and stores the resulting URLs in text files.
GoldenRetriever This class implements the retrieval of a set of web pages generated by the class CrawlListCreator.
GoldenRetriever_ProcessedIndexCorrector This class analyzes the crawling.txt and writes the file processed_idx.txt, containing all the indices (wrt the crawling.txt) of URLs that really has been retrieved, by analyzing if the files reside on the HDD.
RetrievalData Class to hold the data structure for the retrieval of one URL into one file.
SearchResultsAnalyzer Analyzes the info.xml files stored for a crawl and prints out some statistical measures.
SubsetCollectionCreation_Linux Creates a subset of the complete artist collection's retrieved Web pages given a text file with the complete paths to the crawl dirs of each artist that should be included.