comirva.util
Class TermProfileUtils

java.lang.Object
  extended by comirva.util.TermProfileUtils

public class TermProfileUtils
extends Object

This class implements simple utilities for term profile creation, access, modification, and conversion. Furthermore, it implements some calculations for term occurrence.


Constructor Summary
TermProfileUtils()
           
 
Method Summary
static Vector<String> extractTermsFromDocuments(File dir, FileFilter filter)
          Searches in the directory denoted by dir all files which match the file filter filter and extracts from all founded files a list of all occuring terms.
static Vector<String> extractTermsFromDocuments(File dir, FileFilter filter, JLabel statusBar)
          Searches in the directory denoted by dir all files which match the file filter filter and extracts from all founded files a list of all occuring terms.
static void generateEntityTermProfiles(File rootDir, Vector<String> terms)
          Given a root directory that contains subdirs (one for each entity) and a list of terms, this methods generates EntityTermProfiles for every subdir (entity) and serializes the information as XML-files using the classes EntityTermProfile and SingleTermList.
static void generateEntityTermProfiles(File rootDir, Vector<String> terms, JLabel statusBar)
          Given a root directory that contains subdirs (one for each entity) and a list of terms, this methods generates EntityTermProfiles for every subdir (entity) and serializes the information as XML-files using the classes EntityTermProfile and SingleTermList.
static EntityTermProfile getEntityTermProfileFromXML(File xmlFile)
          Reads the EntityTermProfile from an XML-representation.
static String getFileContent(File textFile)
          Fetch the entire content of a text file and return it in a String.
static Vector<String> getMaskedDocumentPaths(Vector<String> docPaths, Vector<Integer> indices)
          Creates and returns a subset of a Vector.
static Hashtable<String,Integer> getNonZeroOccurringTerms(DataMatrix toMatrix, Vector<String> terms)
          Returns all terms as Vector for which the document frequency is greater than 0.
static DataMatrix getOccurrenceMatrixFromETP(EntityTermProfile etp)
          Extracts the term occurrence matrix from an EntityTermProfile.
static DataMatrix getOccurrenceMatrixFromETP(File xmlFile)
          Extracts from an XML-representation of an EntityTermProfile a term occurrence matrix (a DataMatrix whose rows represent the terms and whose columns represent the documents of the entity, the values are only 0 or 1 - according to the (non-)occurrence of the respective term in the respective document).
static DataMatrix getSubsetOfTermOccurrenceMatrix(DataMatrix toMatrix, Vector<String> terms, Vector<String> filterTerms)
          Given a term occurrence matrix (TOM) and a list of terms, this method extracts a subset of the TOM containing only those documents that contain all terms given by the parameter filterTerms.
static DataMatrix getSubsetOfTermOccurrenceMatrix(DataMatrix toMatrix, Vector<String> terms, Vector<String> filterTerms, Vector<Integer> idxDocsContainingAllFilterTerms)
          Given a term occurrence matrix (TOM) and a list of terms, this method extracts a subset of the TOM containing only those documents that contain all terms given by the parameter filterTerms.
static Hashtable getTermsWithHighestOccurrence(DataMatrix toMatrix, Vector<String> terms, int max)
          Determines those terms that have the highest document frequency (highest number of documents where they occur) A maximum of max terms are returned (those with hightest document frequency).
static Hashtable getTermsWithHighestTFxIDF(Vector<Double> tfxidf, Vector<String> terms, int max)
          Determines those terms that have the highest TFxIDF values A maximum of max terms are returned (those with hightest TFxIDF).
static void setFileContent(File textFile, String content)
          Change the contents of text file, overwriting any existing text.
static void updatePathsInETP(File xmlFile)
          Updates the paths of an ETP and its STLs which are stored in an ETP-XML-file.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

TermProfileUtils

public TermProfileUtils()
Method Detail

getOccurrenceMatrixFromETP

public static DataMatrix getOccurrenceMatrixFromETP(File xmlFile)
Extracts from an XML-representation of an EntityTermProfile a term occurrence matrix (a DataMatrix whose rows represent the terms and whose columns represent the documents of the entity, the values are only 0 or 1 - according to the (non-)occurrence of the respective term in the respective document).

Parameters:
xmlFile - the XML-file representing the EntityTermProfile

getOccurrenceMatrixFromETP

public static DataMatrix getOccurrenceMatrixFromETP(EntityTermProfile etp)
Extracts the term occurrence matrix from an EntityTermProfile. The term occurrence matrix is a DataMatrix whose rows represent the terms and whose columns represent the documents of the entity. Its values are only 0 or 1 - according to the (non-)occurrence of the respective term in the respective document).

Parameters:
etp - the EntityTermProfile containing the data

getEntityTermProfileFromXML

public static EntityTermProfile getEntityTermProfileFromXML(File xmlFile)
Reads the EntityTermProfile from an XML-representation.

Parameters:
xmlFile - the XML-file representing the EntityTermProfile

generateEntityTermProfiles

public static void generateEntityTermProfiles(File rootDir,
                                              Vector<String> terms,
                                              JLabel statusBar)
Given a root directory that contains subdirs (one for each entity) and a list of terms, this methods generates EntityTermProfiles for every subdir (entity) and serializes the information as XML-files using the classes EntityTermProfile and SingleTermList.

Parameters:
rootDir - a File that points to the root directory
terms - a Vector containing the term list
statusBar - a JLabel representing the status bar (for updating CoMIRVA's UI)

generateEntityTermProfiles

public static void generateEntityTermProfiles(File rootDir,
                                              Vector<String> terms)
Given a root directory that contains subdirs (one for each entity) and a list of terms, this methods generates EntityTermProfiles for every subdir (entity) and serializes the information as XML-files using the classes EntityTermProfile and SingleTermList.

Parameters:
rootDir - a File that points to the root directory
terms - a Vector containing the term list

getTermsWithHighestOccurrence

public static Hashtable getTermsWithHighestOccurrence(DataMatrix toMatrix,
                                                      Vector<String> terms,
                                                      int max)
Determines those terms that have the highest document frequency (highest number of documents where they occur) A maximum of max terms are returned (those with hightest document frequency).

Parameters:
toMatrix - the term occurrence matrix
terms - a Vector containing the terms
max - the maximum number of returned terms
Returns:
a Hashtable with the most often occurring terms as keys and their occurrence as values

getTermsWithHighestTFxIDF

public static Hashtable getTermsWithHighestTFxIDF(Vector<Double> tfxidf,
                                                  Vector<String> terms,
                                                  int max)
Determines those terms that have the highest TFxIDF values A maximum of max terms are returned (those with hightest TFxIDF).

Parameters:
tfxidf - the TFxIDF values of all terms
terms - a Vector containing the terms
max - the maximum number of returned terms
Returns:
a Hashtable with the most often occurring terms as keys and their occurrence as values

getNonZeroOccurringTerms

public static Hashtable<String,Integer> getNonZeroOccurringTerms(DataMatrix toMatrix,
                                                                 Vector<String> terms)
Returns all terms as Vector for which the document frequency is greater than 0.

Parameters:
toMatrix - the term occurrence matrix
terms - a Vector containing the terms
Returns:
a Hashtable with the most often occurring terms as keys and their occurrence as values

getSubsetOfTermOccurrenceMatrix

public static DataMatrix getSubsetOfTermOccurrenceMatrix(DataMatrix toMatrix,
                                                         Vector<String> terms,
                                                         Vector<String> filterTerms,
                                                         Vector<Integer> idxDocsContainingAllFilterTerms)
Given a term occurrence matrix (TOM) and a list of terms, this method extracts a subset of the TOM containing only those documents that contain all terms given by the parameter filterTerms.

Parameters:
toMatrix - the term occurrence matrix
terms - a Vector representing the terms for which the TOM (DataMatrix) was created
filterTerms - a Vector containing the terms that must be included (all documents that do not include all of these terms will be filtered out)
idxDocsContainingAllFilterTerms - an empty Vector that will contain the indices of the documents that contain the filter terms after execution of this method
Returns:
a DataMatrix that does only contain the term occurrences for the

getSubsetOfTermOccurrenceMatrix

public static DataMatrix getSubsetOfTermOccurrenceMatrix(DataMatrix toMatrix,
                                                         Vector<String> terms,
                                                         Vector<String> filterTerms)
Given a term occurrence matrix (TOM) and a list of terms, this method extracts a subset of the TOM containing only those documents that contain all terms given by the parameter filterTerms.

Parameters:
toMatrix - the term occurrence matrix
terms - a Vector representing the terms for which the TOM (DataMatrix) was created
filterTerms - a Vector containing the terms that must be included (all documents that do not include all of these terms will be filtered out)
Returns:
a DataMatrix that does only contain the term occurrences for the

getMaskedDocumentPaths

public static Vector<String> getMaskedDocumentPaths(Vector<String> docPaths,
                                                    Vector<Integer> indices)
Creates and returns a subset of a Vector. The elements contained in the subset are defined by a Vector of the indices.

Parameters:
docPaths - a Vector containing all document paths
indices - a Vector containing the indices of the document paths that should be returned
Returns:

extractTermsFromDocuments

public static Vector<String> extractTermsFromDocuments(File dir,
                                                       FileFilter filter,
                                                       JLabel statusBar)
Searches in the directory denoted by dir all files which match the file filter filter and extracts from all founded files a list of all occuring terms. Stopwords are automatically removed.

Parameters:
dir - a File that represents the directory where the HTML-files reside
filter - a FileFilter that filters the documents
statusBar - a JLabel representing the status bar (for updating CoMIRVA's UI)
Returns:
a Vector containing the extracted terms

extractTermsFromDocuments

public static Vector<String> extractTermsFromDocuments(File dir,
                                                       FileFilter filter)
Searches in the directory denoted by dir all files which match the file filter filter and extracts from all founded files a list of all occuring terms. Stopwords are automatically removed.

Parameters:
dir - a File that represents the directory where the HTML-files reside
filter - a FileFilter that filters the documents
Returns:
a Vector containing the extracted terms

updatePathsInETP

public static void updatePathsInETP(File xmlFile)
Updates the paths of an ETP and its STLs which are stored in an ETP-XML-file. For technical reasons, absolute paths must be stored in the XML-serialized ETP-files. Thus, moving the files to a new location requires updating the path information. That is what this function does. For this method to work it is vital that the directory of the ETP-XML-file contains a subdirectory with the corresponding STL-XML-files (and documents).

Parameters:
xmlFile - the XML-file representing the EntityTermProfile

getFileContent

public static String getFileContent(File textFile)
Fetch the entire content of a text file and return it in a String.

Parameters:
textFile - is a file which already exists and can be read

setFileContent

public static void setFileContent(File textFile,
                                  String content)
Change the contents of text file, overwriting any existing text.

Parameters:
textFile - is an existing file which can be written to
content - a String representing the content that is to be written to the file