comirva.data
Class SingleTermList

java.lang.Object
  extended by comirva.data.SingleTermList
All Implemented Interfaces:
XMLSerializable, Serializable

public class SingleTermList
extends Object
implements XMLSerializable, Serializable

This class implements a single term list of a text document. It is intended to be used for text mining purposes. In particular, its design reflects the main usage for HTML-files.

See Also:
Serialized Form

Field Summary
(package private)  Vector<String> audioContent
           
(package private)  String crawlDetails
           
(package private)  Vector<String> extAudio
           
(package private)  Vector<String> extImage
           
(package private)  Vector<String> extVideo
           
(package private)  File fileLocal
           
(package private)  Hashtable<String,Integer> frequency
           
(package private)  Vector<String> imageContent
           
(package private)  String searchTerm
           
(package private)  String urlSource
           
(package private)  Vector<String> videoContent
           
 
Constructor Summary
SingleTermList()
          Creates a new SingleTermList-instance.
SingleTermList(File fileLocal)
          Creates a new SingleTermList-instance.
 
Method Summary
 void calculateOccurrences(Vector<String> termList)
          Calculates the occurrences of the terms given as Vector in the argument in the text document and stores their frequency.
 Vector<String> extractLinks(String htmlLine, String searchAttribute)
          Tries to extract all links from the given attribute (in an arbitrary tag) that occur somewhere in the passed htmlLine.
 Vector<String> extractLinks(String htmlLine, String searchTag, String searchAttribute)
          Tries to extract all links from the given attribute in the given tag that occur somewhere in the passed htmlLine.
 Vector<String> extractLinks(String htmlLine, Vector<String> hrefs, String searchAttr)
          Tries to extract all links from the given attribute (in an arbitrary tag) that occur somewhere in the passed htmlLine.
 Vector<String> extractLinks(String htmlLine, Vector<String> hrefs, String searchTag, String searchAttr)
          Tries to extract all links from the given attribute in the given tag that occur somewhere in the passed htmlLine.
 Vector<String> getAudioContent()
           
 String getCrawlDetails()
           
 File getFileLocal()
           
 Hashtable<String,Integer> getFrequency()
           
 Vector<String> getImageContent()
           
 String getSearchTerm()
           
 String getUrlSource()
           
 Vector<String> getVideoContent()
           
 void printTFs()
          Prints a list of the term frequencies.
 void readXML(XMLStreamReader reader)
          Deserializes a SingleTermList-instance from an XML-file.
 void setCrawlDetails(String crawlDetails)
           
 void setExtAudio(Vector<String> extAudio)
           
 void setExtImage(Vector<String> extImage)
           
 void setExtVideo(Vector<String> extVideo)
           
 void setFileLocal(File fileLocal)
           
 void setSearchTerm(String searchTerm)
           
 void setUrlSource(String urlSource)
           
 void writeXML(XMLStreamWriter writer)
          Serializes a SingleTermList-instance as XML-file.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

frequency

Hashtable<String,Integer> frequency

fileLocal

File fileLocal

urlSource

String urlSource

searchTerm

String searchTerm

crawlDetails

String crawlDetails

audioContent

Vector<String> audioContent

imageContent

Vector<String> imageContent

videoContent

Vector<String> videoContent

extAudio

Vector<String> extAudio

extImage

Vector<String> extImage

extVideo

Vector<String> extVideo
Constructor Detail

SingleTermList

public SingleTermList(File fileLocal)
Creates a new SingleTermList-instance.

Parameters:
fileLocal - the file for which the term list should be created/loaded

SingleTermList

public SingleTermList()
Creates a new SingleTermList-instance.

Method Detail

calculateOccurrences

public void calculateOccurrences(Vector<String> termList)
Calculates the occurrences of the terms given as Vector in the argument in the text document and stores their frequency.

Parameters:
termList - a Vector containing a term list

writeXML

public void writeXML(XMLStreamWriter writer)
Serializes a SingleTermList-instance as XML-file.

Specified by:
writeXML in interface XMLSerializable
Parameters:
writer - a XMLStreamWriter that points to the XML-file.
See Also:
XMLSerializable.writeXML(javax.xml.stream.XMLStreamWriter)

readXML

public void readXML(XMLStreamReader reader)
Deserializes a SingleTermList-instance from an XML-file.

Specified by:
readXML in interface XMLSerializable
Parameters:
reader - a XMLStreamReader that points to the XML-file.
See Also:
XMLSerializable.readXML(javax.xml.stream.XMLStreamReader)

printTFs

public void printTFs()
Prints a list of the term frequencies.


getCrawlDetails

public String getCrawlDetails()
Returns:
Returns the crawlDetails.

setCrawlDetails

public void setCrawlDetails(String crawlDetails)
Parameters:
crawlDetails - The crawlDetails to set.

getFileLocal

public File getFileLocal()
Returns:
Returns the fileLocal.

setFileLocal

public void setFileLocal(File fileLocal)
Parameters:
fileLocal - The fileLocal to set.

getSearchTerm

public String getSearchTerm()
Returns:
Returns the searchTerm.

setSearchTerm

public void setSearchTerm(String searchTerm)
Parameters:
searchTerm - The searchTerm to set.

getUrlSource

public String getUrlSource()
Returns:
Returns the urlSource.

setUrlSource

public void setUrlSource(String urlSource)
Parameters:
urlSource - The urlSource to set.

getFrequency

public Hashtable<String,Integer> getFrequency()
Returns:
Returns the frequency.

getAudioContent

public Vector<String> getAudioContent()
Returns:
Returns the URLs to the indexed audio content as Vector

getImageContent

public Vector<String> getImageContent()
Returns:
Returns the URLs to the indexed image content as Vector

getVideoContent

public Vector<String> getVideoContent()
Returns:
Returns the URLs to the indexed video content as Vector

setExtAudio

public void setExtAudio(Vector<String> extAudio)
Parameters:
extAudio - a Vector containing possible file extensions for audio files

setExtImage

public void setExtImage(Vector<String> extImage)
Parameters:
extImage - a Vector containing possible file extensions for image files

setExtVideo

public void setExtVideo(Vector<String> extVideo)
Parameters:
extVideo - a Vector containing possible file extensions for video files

extractLinks

public Vector<String> extractLinks(String htmlLine,
                                   Vector<String> hrefs,
                                   String searchTag,
                                   String searchAttr)
Tries to extract all links from the given attribute in the given tag that occur somewhere in the passed htmlLine.

Parameters:
htmlLine - the HTML code to analyze
hrefs - just for recursive call
searchTag - the tag to search (e.g. "a")
searchAttr - the attribute to search within the tag (e.g. "href")
Returns:
a Vector containing the complete URLs to the found links; null if no link was found

extractLinks

public Vector<String> extractLinks(String htmlLine,
                                   String searchTag,
                                   String searchAttribute)
Tries to extract all links from the given attribute in the given tag that occur somewhere in the passed htmlLine.

Parameters:
htmlLine - the HTML code to analyze
searchTag - the tag to search (e.g. "a")
searchAttr - the attribute to search within the tag (e.g. "href")
Returns:
a Vector containing the complete URLs to the found links; null if no link was found

extractLinks

public Vector<String> extractLinks(String htmlLine,
                                   Vector<String> hrefs,
                                   String searchAttr)
Tries to extract all links from the given attribute (in an arbitrary tag) that occur somewhere in the passed htmlLine.

Parameters:
htmlLine - the HTML code to analyze
hrefs - just for recursive call
searchAttr - the attribute to search within the tag (e.g. "href")
Returns:
a Vector containing the complete URLs to the found links; null if no link was found

extractLinks

public Vector<String> extractLinks(String htmlLine,
                                   String searchAttribute)
Tries to extract all links from the given attribute (in an arbitrary tag) that occur somewhere in the passed htmlLine.

Parameters:
htmlLine - the HTML code to analyze
searchAttr - the attribute to search within the tag (e.g. "href", "src")
Returns:
a Vector containing the complete URLs to the found links; null if no link was found