comirva.data
Class EntityTermProfile

java.lang.Object
  extended by comirva.data.EntityTermProfile
All Implemented Interfaces:
XMLSerializable, Serializable

public class EntityTermProfile
extends Object
implements XMLSerializable, Serializable

This class implements a term profile for entities like artist names. It is intended to be used for text mining purposes. In particular, its design reflects the main usage for HTML-files.

See Also:
Serialized Form

Field Summary
(package private)  String crawlDetails
           
(package private)  File dirLocal
           
(package private)  Hashtable<String,Integer> documentFrequency
           
(package private)  String entityName
           
(package private)  Vector<String> extAudio
           
(package private)  Vector<String> extImage
           
(package private)  Vector<String> extVideo
           
(package private)  Hashtable<String,Double> IDF
           
(package private)  Integer numberDocuments
           
(package private)  Vector<SingleTermList> singleTermLists
           
(package private)  Hashtable<String,Long> termFrequency
           
(package private)  Vector<Vector<Integer>> termOccurrenceOnDocuments
           
(package private)  Vector<String> terms
           
(package private)  int[][] tfDocs
           
(package private)  Hashtable<String,Double> TFxIDF
           
 
Constructor Summary
EntityTermProfile()
          Creates a new EntityTermProfile-instance.
EntityTermProfile(File dirLocal)
          Creates a new EntityTermProfile-instance.
 
Method Summary
 void calculateOccurrences(Vector<String> termList, FileFilter documentFileFilter)
          Calculates the occurrences of the terms given as Vector in the argument in all text documents of the entity and stores their frequency and information on their occurrences in the entity's documents.
 String getCrawlDetails()
           
 File getDirLocal()
           
 Hashtable<String,Integer> getDocumentFrequency()
           
 String getEntityName()
           
 TermsWeights getMostImportantTerms(int maxNoTerms, Hashtable termWeightings)
           
 Integer getNumberDocuments()
           
 Vector<SingleTermList> getSingleTermLists()
           
 Hashtable<String,Long> getTermFrequency()
           
 Vector<Vector<Integer>> getTermOccurrenceOnDocuments()
           
 Vector<String> getTerms()
           
 Hashtable<String,Double> getTFxIDF()
           
 void readXML(XMLStreamReader reader)
          Deserializes an EntityTermProfile-instance from an XML-file.
 void setCrawlDetails(String crawlDetails)
           
 void setDirLocal(File dirLocal)
           
 void setEntityName(String entityName)
           
 void setExtAudio(Vector<String> extAudio)
           
 void setExtImage(Vector<String> extImage)
           
 void setExtVideo(Vector<String> extVideo)
           
 void writeXML(XMLStreamWriter writer)
          Serializes an EntityTermProfile-instance as XML-file.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

singleTermLists

Vector<SingleTermList> singleTermLists

terms

Vector<String> terms

termFrequency

Hashtable<String,Long> termFrequency

termOccurrenceOnDocuments

Vector<Vector<Integer>> termOccurrenceOnDocuments

documentFrequency

Hashtable<String,Integer> documentFrequency

TFxIDF

Hashtable<String,Double> TFxIDF

IDF

Hashtable<String,Double> IDF

tfDocs

int[][] tfDocs

dirLocal

File dirLocal

entityName

String entityName

numberDocuments

Integer numberDocuments

crawlDetails

String crawlDetails

extAudio

Vector<String> extAudio

extImage

Vector<String> extImage

extVideo

Vector<String> extVideo
Constructor Detail

EntityTermProfile

public EntityTermProfile(File dirLocal)
Creates a new EntityTermProfile-instance.

Parameters:
dirLocal - the directory where all documents belonging to the entity are stored

EntityTermProfile

public EntityTermProfile()
Creates a new EntityTermProfile-instance.

Method Detail

calculateOccurrences

public void calculateOccurrences(Vector<String> termList,
                                 FileFilter documentFileFilter)
Calculates the occurrences of the terms given as Vector in the argument in all text documents of the entity and stores their frequency and information on their occurrences in the entity's documents.

Parameters:
termList - a Vector containing a term list
documentFileFilter - a FileFilter for the documents that should be searched for the terms in the term list

writeXML

public void writeXML(XMLStreamWriter writer)
Serializes an EntityTermProfile-instance as XML-file.

Specified by:
writeXML in interface XMLSerializable
Parameters:
writer - a XMLStreamWriter that points to the XML-file.
See Also:
XMLSerializable.writeXML(javax.xml.stream.XMLStreamWriter)

readXML

public void readXML(XMLStreamReader reader)
Deserializes an EntityTermProfile-instance from an XML-file.

Specified by:
readXML in interface XMLSerializable
Parameters:
reader - a XMLStreamReader that points to the XML-file.
See Also:
XMLSerializable.readXML(javax.xml.stream.XMLStreamReader)

getMostImportantTerms

public TermsWeights getMostImportantTerms(int maxNoTerms,
                                          Hashtable termWeightings)

getCrawlDetails

public String getCrawlDetails()
Returns:
Returns the crawlDetails.

setCrawlDetails

public void setCrawlDetails(String crawlDetails)
Parameters:
crawlDetails - The crawlDetails to set.

getDirLocal

public File getDirLocal()
Returns:
Returns the dirLocal.

setDirLocal

public void setDirLocal(File dirLocal)
Parameters:
dirLocal - The dirLocal to set.

getEntityName

public String getEntityName()
Returns:
Returns the entityName.

setEntityName

public void setEntityName(String entityName)
Parameters:
entityName - The entityName to set.

getTermFrequency

public Hashtable<String,Long> getTermFrequency()
Returns:
Returns the termFrequency.

getDocumentFrequency

public Hashtable<String,Integer> getDocumentFrequency()
Returns:
Returns the documentFrequency.

getTermOccurrenceOnDocuments

public Vector<Vector<Integer>> getTermOccurrenceOnDocuments()
Returns:
Returns the termOccurrenceOnDocuments.

getTerms

public Vector<String> getTerms()
Returns:
Returns the terms.

getNumberDocuments

public Integer getNumberDocuments()
Returns:
Returns the numberDocuments.

getTFxIDF

public Hashtable<String,Double> getTFxIDF()
Returns:
Returns the tFxIDF.

getSingleTermLists

public Vector<SingleTermList> getSingleTermLists()
Returns:
Returns the SingleTermLists.

setExtAudio

public void setExtAudio(Vector<String> extAudio)
Parameters:
extAudio - a Vector containing possible file extensions for audio files

setExtImage

public void setExtImage(Vector<String> extImage)
Parameters:
extImage - a Vector containing possible file extensions for image files

setExtVideo

public void setExtVideo(Vector<String> extVideo)
Parameters:
extVideo - a Vector containing possible file extensions for video files