comirva.web.crawling
Class AnySearch

java.lang.Object
  extended by comirva.web.crawling.AnySearch

public class AnySearch
extends Object

This class provides simple access to the results of search engines using Google-like parameters. It can be used directly to crawl the web by defining a search engine's URL and a query.


Field Summary
static int MAX_RETRIES
           
static int MAX_WAITTIME
           
static int RESULTS_TO_REQUEST
           
static int RETRY_INTERVAL
           
 
Constructor Summary
AnySearch(AnySearchConfig asCfg, String engineURL, String query)
          Creates a new AnySearch-instance to crawl the web.
 
Method Summary
 int getPageCount()
          Returns the number of web pages the search engine returned for the query.
 URL[] getResultURLs(int maxNumber)
          Returns an URL-array with the URLs that the query to the search engine yielded.
 boolean timedOut()
          Indicates whether the connection to retrieve the full page exceeded the time limit.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

RESULTS_TO_REQUEST

public static int RESULTS_TO_REQUEST

MAX_RETRIES

public static int MAX_RETRIES

RETRY_INTERVAL

public static int RETRY_INTERVAL

MAX_WAITTIME

public static int MAX_WAITTIME
Constructor Detail

AnySearch

public AnySearch(AnySearchConfig asCfg,
                 String engineURL,
                 String query)
          throws WebCrawlException
Creates a new AnySearch-instance to crawl the web.

Parameters:
asCfg - an AnySearchConfig containing the configuration
engineURL - a String with the URL of the search engine to be used
query - a String specifying the exact search query
Throws:
WebCrawlException
Method Detail

getPageCount

public int getPageCount()
Returns the number of web pages the search engine returned for the query.

Returns:
the number of web pages found

getResultURLs

public URL[] getResultURLs(int maxNumber)
Returns an URL-array with the URLs that the query to the search engine yielded.

Parameters:
maxNumber - the maximum number of returned URLs (if more URLs than maxNumber were found, return only maxNumber)
Returns:
a URL[] containing the URLs

timedOut

public boolean timedOut()
Indicates whether the connection to retrieve the full page exceeded the time limit.

Returns:
true, if a time out occurred, false otherwise