|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object java.lang.Thread comirva.web.crawling.agmis.GoldenRetriever
public class GoldenRetriever
This class implements the retrieval of a set of web pages generated by the class CrawlListCreator. The external tool wget is used and some load balancing strategies, i.e. to prevent continuously retrieving pages from the same web site are implemented.
Nested Class Summary |
---|
Nested classes/interfaces inherited from class java.lang.Thread |
---|
Thread.State, Thread.UncaughtExceptionHandler |
Field Summary | |
---|---|
static Vector<URL> |
blacklistSites
|
static boolean |
CHECK_FOR_DIR_STRUCTURE
|
static Vector<Integer> |
currentlyDownloadingIdx
|
static int |
END_OFFSET
|
static int |
EQUAL_HOST_LEVELS
|
static DownloadControlDataVector |
hosts
|
static int |
MAX_PARALLEL_DOWNLOADS
|
static int |
MAX_SKIP_URLS
|
static int |
MAX_URLS_IN_CRAWL_LIST
|
static int[] |
PAGE_NO_RANGE
|
static File |
PROCESSED_IDX_FILE
|
static TreeSet<Integer> |
retrievedIdx
|
static Vector<RetrievalData> |
ri
|
static File |
ROOT_DIR
|
static int |
START_OFFSET
|
static File |
URL_FILE
|
static boolean |
USE_PAGE_NO_RANGE_FILTER
|
static long |
WAIT_BETWEEN_RETRIEVALS_FROM_SAME_HOST
|
static File |
WGET
|
Fields inherited from class java.lang.Thread |
---|
MAX_PRIORITY, MIN_PRIORITY, NORM_PRIORITY |
Constructor Summary | |
---|---|
GoldenRetriever(int threadNo)
Constructs a new instance of the GoldenRetriever. |
Method Summary | |
---|---|
static void |
createDirectoryStructure(File f)
Creates all directories from the path given by the directory f to the root dir if they are not already existant. |
static int |
doWaitFor(Process p)
Method to perform a "wait" for a process and return its exit value. |
protected static void |
fillCrawlList()
Inserts the first or next set of URLs to be fetched into the crawl list. |
protected void |
finalize()
Ensure that all wget processes are terminated when program exists. |
static void |
main(String[] args)
|
void |
run()
Starts the retrieval process. |
Methods inherited from class java.lang.Thread |
---|
activeCount, checkAccess, countStackFrames, currentThread, destroy, dumpStack, enumerate, getAllStackTraces, getContextClassLoader, getDefaultUncaughtExceptionHandler, getId, getName, getPriority, getStackTrace, getState, getThreadGroup, getUncaughtExceptionHandler, holdsLock, interrupt, interrupted, isAlive, isDaemon, isInterrupted, join, join, join, resume, setContextClassLoader, setDaemon, setDefaultUncaughtExceptionHandler, setName, setPriority, setUncaughtExceptionHandler, sleep, sleep, start, stop, stop, suspend, toString, yield |
Methods inherited from class java.lang.Object |
---|
clone, equals, getClass, hashCode, notify, notifyAll, wait, wait, wait |
Field Detail |
---|
public static final File ROOT_DIR
public static final File URL_FILE
public static final File PROCESSED_IDX_FILE
public static final File WGET
public static final boolean CHECK_FOR_DIR_STRUCTURE
public static int START_OFFSET
public static int END_OFFSET
public static final int MAX_SKIP_URLS
public static final int[] PAGE_NO_RANGE
public static final boolean USE_PAGE_NO_RANGE_FILTER
public static final int MAX_PARALLEL_DOWNLOADS
public static final long WAIT_BETWEEN_RETRIEVALS_FROM_SAME_HOST
public static final int EQUAL_HOST_LEVELS
public static final int MAX_URLS_IN_CRAWL_LIST
public static Vector<RetrievalData> ri
public static TreeSet<Integer> retrievedIdx
public static DownloadControlDataVector hosts
public static Vector<Integer> currentlyDownloadingIdx
public static Vector<URL> blacklistSites
Constructor Detail |
---|
public GoldenRetriever(int threadNo)
Method Detail |
---|
public static int doWaitFor(Process p)
process.waitFor()
never returning.
public void run()
run
in interface Runnable
run
in class Thread
protected static void fillCrawlList()
public static void createDirectoryStructure(File f)
f
- public static void main(String[] args)
protected void finalize()
finalize
in class Object
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |