comirva.config
Class WebCrawlingConfig

java.lang.Object
  extended by comirva.config.WebCrawlingConfig
All Implemented Interfaces:
AnySearchConfig

public class WebCrawlingConfig
extends Object
implements AnySearchConfig

This class represents a configuration for a simple WebCrawler. The WebCrawler uses an arbitrary search engine to get a bunch of URLs and then crawls them. It is used to pass a configuration to the WebCrawler instance.


Constructor Summary
WebCrawlingConfig(String searchEngineURL, int numberOfRetries, int intervalBetweenRetries, int firstRequestedPageNumber, String additionalKeywords, boolean additionalKeywordsAfterSearchString, int numberOfPages, String pathStoreRetrievedPages, String pathExternalCrawler, boolean isStoreURLList, boolean isQuoteSearchTerms)
          Creates a new instance of a WebCrawling-Configuration.
 
Method Summary
 String getAdditionalKeywords()
          Returns the additional keywords to be added to the search string.
 boolean getAdditionalKeywordsAfterSearchString()
          Returns whether additional keywords are to be placed after the search string or before.
 int getFirstRequestedPageNumber()
          Returns the number of the first requested page (usually 0).
 int getIntervalBetweenRetries()
          Returns the interval between two retries in case of failure (in seconds).
 int getNumberOfRequestedPages()
          Returns the number of pages that should be returned by the search engine and subsequently crawled.
 int getNumberOfRetries()
          Returns the number of retries in case of failure to raise a search query.
 String getPathExternalCrawler()
          Returns the command needed to start the external crawler.
 String getPathStoreRetrievedPages()
          Returns the root directory where all retrieved web pages are to be stored.
 String getSearchEngineURL()
          Returns the URL of the search engine to be used.
 boolean isQuoteSearchTerms()
          Returns whether the search terms should be automatically quoted (i.e., phrase search to be used).
 boolean isStoreURLList()
          Returns whether a list of all crawled URLs should be stored for every query term.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

WebCrawlingConfig

public WebCrawlingConfig(String searchEngineURL,
                         int numberOfRetries,
                         int intervalBetweenRetries,
                         int firstRequestedPageNumber,
                         String additionalKeywords,
                         boolean additionalKeywordsAfterSearchString,
                         int numberOfPages,
                         String pathStoreRetrievedPages,
                         String pathExternalCrawler,
                         boolean isStoreURLList,
                         boolean isQuoteSearchTerms)
Creates a new instance of a WebCrawling-Configuration.

Parameters:
searchEngineURL - a String containing the URL of the search engine
numberOfRetries - the number of retries in case of failure
intervalBetweenRetries - the interval between two retries (in seconds)
firstRequestedPageNumber - the number (index) of the first requested page
additionalKeywords - additional keywords in the query
additionalKeywordsAfterSearchString - whether the additional keywords are to be placed after (or before) the search string
numberOfPages - number of pages to retrieve
pathStoreRetrievedPages - local path where the retrieved html documents should be stored
pathExternalCrawler - command to run wget
isStoreURLList - flag to determine whether a list of all retrieved URLs should be stored for every query term
isQuoteSearchTerms - should the search terms automatically be quoted (phrase search)
Method Detail

getSearchEngineURL

public String getSearchEngineURL()
Returns the URL of the search engine to be used.

Returns:
a String containing the URL of the search engine.

getNumberOfRetries

public int getNumberOfRetries()
Returns the number of retries in case of failure to raise a search query.

Specified by:
getNumberOfRetries in interface AnySearchConfig
Returns:
the number of retries

getIntervalBetweenRetries

public int getIntervalBetweenRetries()
Returns the interval between two retries in case of failure (in seconds).

Specified by:
getIntervalBetweenRetries in interface AnySearchConfig
Returns:
the interval between two retries

getFirstRequestedPageNumber

public int getFirstRequestedPageNumber()
Returns the number of the first requested page (usually 0). Setting the value of this parameter is necessary if the search engine used does not provide more than a fixed number of results. Google, for example, limits this number to 100.

Specified by:
getFirstRequestedPageNumber in interface AnySearchConfig
Returns:
the number of the first requested page

getAdditionalKeywords

public String getAdditionalKeywords()
Returns the additional keywords to be added to the search string.

Returns:
a String containing the additional keywords

getAdditionalKeywordsAfterSearchString

public boolean getAdditionalKeywordsAfterSearchString()
Returns whether additional keywords are to be placed after the search string or before.

Returns:
true if additional keywords should be placed after the search string, false if they are placed before the search string

getNumberOfRequestedPages

public int getNumberOfRequestedPages()
Returns the number of pages that should be returned by the search engine and subsequently crawled.

Specified by:
getNumberOfRequestedPages in interface AnySearchConfig
Returns:
the number of web pages

getPathStoreRetrievedPages

public String getPathStoreRetrievedPages()
Returns the root directory where all retrieved web pages are to be stored.

Returns:
a String containing the path where to retrieved pages should be stored.

getPathExternalCrawler

public String getPathExternalCrawler()
Returns the command needed to start the external crawler.

Returns:
a String containing the path to an external crawler.

isStoreURLList

public boolean isStoreURLList()
Returns whether a list of all crawled URLs should be stored for every query term.

Returns:
true if a text file containing all crawled URLs is to be stored for every query term false if information of crawled URLs is to be discarded

isQuoteSearchTerms

public boolean isQuoteSearchTerms()
Returns whether the search terms should be automatically quoted (i.e., phrase search to be used).

Returns:
true if all search terms are automatically quoted false if search will be conducted using the search terms as they are