|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object comirva.util.TermProfileUtils
public class TermProfileUtils
This class implements simple utilities for term profile creation, access, modification, and conversion. Furthermore, it implements some calculations for term occurrence.
Constructor Summary | |
---|---|
TermProfileUtils()
|
Method Summary | |
---|---|
static Vector<String> |
extractTermsFromDocuments(File dir,
FileFilter filter)
Searches in the directory denoted by dir all files which
match the file filter filter and extracts from
all founded files a list of all occuring terms. |
static Vector<String> |
extractTermsFromDocuments(File dir,
FileFilter filter,
JLabel statusBar)
Searches in the directory denoted by dir all files which
match the file filter filter and extracts from
all founded files a list of all occuring terms. |
static void |
generateEntityTermProfiles(File rootDir,
Vector<String> terms)
Given a root directory that contains subdirs (one for each entity) and a list of terms, this methods generates EntityTermProfiles for every subdir (entity) and serializes the information as XML-files using the classes EntityTermProfile and SingleTermList. |
static void |
generateEntityTermProfiles(File rootDir,
Vector<String> terms,
JLabel statusBar)
Given a root directory that contains subdirs (one for each entity) and a list of terms, this methods generates EntityTermProfiles for every subdir (entity) and serializes the information as XML-files using the classes EntityTermProfile and SingleTermList. |
static EntityTermProfile |
getEntityTermProfileFromXML(File xmlFile)
Reads the EntityTermProfile from an XML-representation. |
static String |
getFileContent(File textFile)
Fetch the entire content of a text file and return it in a String. |
static Vector<String> |
getMaskedDocumentPaths(Vector<String> docPaths,
Vector<Integer> indices)
Creates and returns a subset of a Vector |
static Hashtable<String,Integer> |
getNonZeroOccurringTerms(DataMatrix toMatrix,
Vector<String> terms)
Returns all terms as Vector for which the document frequency is greater than 0. |
static DataMatrix |
getOccurrenceMatrixFromETP(EntityTermProfile etp)
Extracts the term occurrence matrix from an EntityTermProfile. |
static DataMatrix |
getOccurrenceMatrixFromETP(File xmlFile)
Extracts from an XML-representation of an EntityTermProfile a term occurrence matrix (a DataMatrix whose rows represent the terms and whose columns represent the documents of the entity, the values are only 0 or 1 - according to the (non-)occurrence of the respective term in the respective document). |
static DataMatrix |
getSubsetOfTermOccurrenceMatrix(DataMatrix toMatrix,
Vector<String> terms,
Vector<String> filterTerms)
Given a term occurrence matrix (TOM) and a list of terms, this method extracts a subset of the TOM containing only those documents that contain all terms given by the parameter filterTerms . |
static DataMatrix |
getSubsetOfTermOccurrenceMatrix(DataMatrix toMatrix,
Vector<String> terms,
Vector<String> filterTerms,
Vector<Integer> idxDocsContainingAllFilterTerms)
Given a term occurrence matrix (TOM) and a list of terms, this method extracts a subset of the TOM containing only those documents that contain all terms given by the parameter filterTerms . |
static Hashtable |
getTermsWithHighestOccurrence(DataMatrix toMatrix,
Vector<String> terms,
int max)
Determines those terms that have the highest document frequency (highest number of documents where they occur) A maximum of max terms are returned (those with
hightest document frequency). |
static Hashtable |
getTermsWithHighestTFxIDF(Vector<Double> tfxidf,
Vector<String> terms,
int max)
Determines those terms that have the highest TFxIDF values A maximum of max terms are returned (those with
hightest TFxIDF). |
static void |
setFileContent(File textFile,
String content)
Change the contents of text file, overwriting any existing text. |
static void |
updatePathsInETP(File xmlFile)
Updates the paths of an ETP and its STLs which are stored in an ETP-XML-file. |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Constructor Detail |
---|
public TermProfileUtils()
Method Detail |
---|
public static DataMatrix getOccurrenceMatrixFromETP(File xmlFile)
xmlFile
- the XML-file representing the EntityTermProfilepublic static DataMatrix getOccurrenceMatrixFromETP(EntityTermProfile etp)
etp
- the EntityTermProfile containing the datapublic static EntityTermProfile getEntityTermProfileFromXML(File xmlFile)
xmlFile
- the XML-file representing the EntityTermProfilepublic static void generateEntityTermProfiles(File rootDir, Vector<String> terms, JLabel statusBar)
rootDir
- a File that points to the root directoryterms
- a Vector containing the term liststatusBar
- a JLabel representing the status bar (for updating CoMIRVA's UI)public static void generateEntityTermProfiles(File rootDir, Vector<String> terms)
rootDir
- a File that points to the root directoryterms
- a Vector containing the term listpublic static Hashtable getTermsWithHighestOccurrence(DataMatrix toMatrix, Vector<String> terms, int max)
max
terms are returned (those with
hightest document frequency).
toMatrix
- the term occurrence matrixterms
- a Vector containing the termsmax
- the maximum number of returned terms
public static Hashtable getTermsWithHighestTFxIDF(Vector<Double> tfxidf, Vector<String> terms, int max)
max
terms are returned (those with
hightest TFxIDF).
tfxidf
- the TFxIDF values of all termsterms
- a Vector containing the termsmax
- the maximum number of returned terms
public static Hashtable<String,Integer> getNonZeroOccurringTerms(DataMatrix toMatrix, Vector<String> terms)
toMatrix
- the term occurrence matrixterms
- a Vector containing the terms
public static DataMatrix getSubsetOfTermOccurrenceMatrix(DataMatrix toMatrix, Vector<String> terms, Vector<String> filterTerms, Vector<Integer> idxDocsContainingAllFilterTerms)
filterTerms
.
toMatrix
- the term occurrence matrixterms
- a VectorfilterTerms
- a VectoridxDocsContainingAllFilterTerms
- an empty Vectorpublic static DataMatrix getSubsetOfTermOccurrenceMatrix(DataMatrix toMatrix, Vector<String> terms, Vector<String> filterTerms)
filterTerms
.
toMatrix
- the term occurrence matrixterms
- a Vector representing the terms for which the TOM (DataMatrix) was createdfilterTerms
- a Vector containing the terms that must be included (all documents that do not include all of these terms will be filtered out)
public static Vector<String> getMaskedDocumentPaths(Vector<String> docPaths, Vector<Integer> indices)
docPaths
- a Vectorindices
- a Vectorpublic static Vector<String> extractTermsFromDocuments(File dir, FileFilter filter, JLabel statusBar)
dir
all files which
match the file filter filter
and extracts from
all founded files a list of all occuring terms. Stopwords are automatically
removed.
dir
- a File that represents the directory where the HTML-files residefilter
- a FileFilter that filters the documentsstatusBar
- a JLabel representing the status bar (for updating CoMIRVA's UI)
public static Vector<String> extractTermsFromDocuments(File dir, FileFilter filter)
dir
all files which
match the file filter filter
and extracts from
all founded files a list of all occuring terms. Stopwords are automatically
removed.
dir
- a File that represents the directory where the HTML-files residefilter
- a FileFilter that filters the documents
public static void updatePathsInETP(File xmlFile)
xmlFile
- the XML-file representing the EntityTermProfilepublic static String getFileContent(File textFile)
textFile
- is a file which already exists and can be readpublic static void setFileContent(File textFile, String content)
textFile
- is an existing file which can be written tocontent
- a String representing the content that is to be written to the file
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |