Package comirva.web.crawling.agmis

Class Summary
CrawlListCreator Creates a list of all URLs to be crawled from the info.xml files.
ExaleadRetriever This class automatically queries exalead and stores the resulting URLs in text files.
GoldenRetriever This class implements the retrieval of a set of web pages generated by the class CrawlListCreator.
GoldenRetriever_ProcessedIndexCorrector This class analyzes the crawling.txt and writes the file processed_idx.txt, containing all the indices (wrt the crawling.txt) of URLs that really has been retrieved, by analyzing if the files reside on the HDD.
SearchResultsAnalyzer Analyzes the info.xml files stored for a crawl and prints out some statistical measures.
SubsetCollectionCreation_Linux Creates a subset of the complete artist collection's retrieved Web pages given a text file with the complete paths to the crawl dirs of each artist that should be included.