comirva.web.crawling.agmis
Class GoldenRetriever

java.lang.Object
  extended by java.lang.Thread
      extended by comirva.web.crawling.agmis.GoldenRetriever
All Implemented Interfaces:
Runnable

public class GoldenRetriever
extends Thread

This class implements the retrieval of a set of web pages generated by the class CrawlListCreator. The external tool wget is used and some load balancing strategies, i.e. to prevent continuously retrieving pages from the same web site are implemented.


Nested Class Summary
 
Nested classes/interfaces inherited from class java.lang.Thread
Thread.State, Thread.UncaughtExceptionHandler
 
Field Summary
static Vector<URL> blacklistSites
           
static boolean CHECK_FOR_DIR_STRUCTURE
           
static Vector<Integer> currentlyDownloadingIdx
           
static int END_OFFSET
           
static int EQUAL_HOST_LEVELS
           
static DownloadControlDataVector hosts
           
static int MAX_PARALLEL_DOWNLOADS
           
static int MAX_SKIP_URLS
           
static int MAX_URLS_IN_CRAWL_LIST
           
static int[] PAGE_NO_RANGE
           
static File PROCESSED_IDX_FILE
           
static TreeSet<Integer> retrievedIdx
           
static Vector<RetrievalData> ri
           
static File ROOT_DIR
           
static int START_OFFSET
           
static File URL_FILE
           
static boolean USE_PAGE_NO_RANGE_FILTER
           
static long WAIT_BETWEEN_RETRIEVALS_FROM_SAME_HOST
           
static File WGET
           
 
Fields inherited from class java.lang.Thread
MAX_PRIORITY, MIN_PRIORITY, NORM_PRIORITY
 
Constructor Summary
GoldenRetriever(int threadNo)
          Constructs a new instance of the GoldenRetriever.
 
Method Summary
static void createDirectoryStructure(File f)
          Creates all directories from the path given by the directory f to the root dir if they are not already existant.
static int doWaitFor(Process p)
          Method to perform a "wait" for a process and return its exit value.
protected static void fillCrawlList()
          Inserts the first or next set of URLs to be fetched into the crawl list.
protected  void finalize()
          Ensure that all wget processes are terminated when program exists.
static void main(String[] args)
           
 void run()
          Starts the retrieval process.
 
Methods inherited from class java.lang.Thread
activeCount, checkAccess, countStackFrames, currentThread, destroy, dumpStack, enumerate, getAllStackTraces, getContextClassLoader, getDefaultUncaughtExceptionHandler, getId, getName, getPriority, getStackTrace, getState, getThreadGroup, getUncaughtExceptionHandler, holdsLock, interrupt, interrupted, isAlive, isDaemon, isInterrupted, join, join, join, resume, setContextClassLoader, setDaemon, setDefaultUncaughtExceptionHandler, setName, setPriority, setUncaughtExceptionHandler, sleep, sleep, start, stop, stop, suspend, toString, yield
 
Methods inherited from class java.lang.Object
clone, equals, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

ROOT_DIR

public static final File ROOT_DIR

URL_FILE

public static final File URL_FILE

PROCESSED_IDX_FILE

public static final File PROCESSED_IDX_FILE

WGET

public static final File WGET

CHECK_FOR_DIR_STRUCTURE

public static final boolean CHECK_FOR_DIR_STRUCTURE
See Also:
Constant Field Values

START_OFFSET

public static int START_OFFSET

END_OFFSET

public static int END_OFFSET

MAX_SKIP_URLS

public static final int MAX_SKIP_URLS
See Also:
Constant Field Values

PAGE_NO_RANGE

public static final int[] PAGE_NO_RANGE

USE_PAGE_NO_RANGE_FILTER

public static final boolean USE_PAGE_NO_RANGE_FILTER
See Also:
Constant Field Values

MAX_PARALLEL_DOWNLOADS

public static final int MAX_PARALLEL_DOWNLOADS
See Also:
Constant Field Values

WAIT_BETWEEN_RETRIEVALS_FROM_SAME_HOST

public static final long WAIT_BETWEEN_RETRIEVALS_FROM_SAME_HOST
See Also:
Constant Field Values

EQUAL_HOST_LEVELS

public static final int EQUAL_HOST_LEVELS
See Also:
Constant Field Values

MAX_URLS_IN_CRAWL_LIST

public static final int MAX_URLS_IN_CRAWL_LIST
See Also:
Constant Field Values

ri

public static Vector<RetrievalData> ri

retrievedIdx

public static TreeSet<Integer> retrievedIdx

hosts

public static DownloadControlDataVector hosts

currentlyDownloadingIdx

public static Vector<Integer> currentlyDownloadingIdx

blacklistSites

public static Vector<URL> blacklistSites
Constructor Detail

GoldenRetriever

public GoldenRetriever(int threadNo)
Constructs a new instance of the GoldenRetriever.

Method Detail

doWaitFor

public static int doWaitFor(Process p)
Method to perform a "wait" for a process and return its exit value. This is a workaround for process.waitFor() never returning.


run

public void run()
Starts the retrieval process.

Specified by:
run in interface Runnable
Overrides:
run in class Thread

fillCrawlList

protected static void fillCrawlList()
Inserts the first or next set of URLs to be fetched into the crawl list.


createDirectoryStructure

public static void createDirectoryStructure(File f)
Creates all directories from the path given by the directory f to the root dir if they are not already existant.

Parameters:
f -

main

public static void main(String[] args)

finalize

protected void finalize()
Ensure that all wget processes are terminated when program exists.

Overrides:
finalize in class Object