|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objecthultig.sumo.CorpusIndex
public class CorpusIndex
Represents a corpora lexical index, by associating a unique number, the index, to each word. This main goal of this class is to have a more efficient (faster) text processing.
The text corpus may be incrementally added, file by file, and the
redefinition of the index is executed by invoking the
rebuild()
method. The dictionary is reset through
the clearHash()
method.
University of Beira Interior (UBI)
Centre For Human Language Technology and Bioinformatics (HULTIG)
Field Summary | |
---|---|
Hashtable<String,Integer> |
hstab
An hash table for counting word frequencies in corpora. |
TreeMap<Integer,String> |
idict
A corpora index with the numeric index being the key. |
static int |
NO_TRUNC
The code for disabling word truncation, see TRUNCV |
TreeMap<String,Integer> |
sdict
A corpora index with the words/tokens being the keys. |
int |
TRUNCV
Size of word truncation. |
Constructor Summary | |
---|---|
CorpusIndex()
The default constructor initializes the class main properties and components, by also calling the clearHash() method. |
|
CorpusIndex(int truncv)
Provides the main initializations on this class, by also calling the clearHash() method, and sets the word
truncation value. |
Method Summary | |
---|---|
void |
add(Sentence stc)
Incrementally adds the words of a given Sentence to
this corpora index. |
void |
add(Sentence[] vs)
Adds the words contained in an array os sentences to this corpus index. |
void |
add(String str)
Adds the words contained in a given string to this corpus index. |
void |
add(String[] vs)
Adds the words contained in an array of strings to this corpus index. |
void |
addText(Text txt)
Incrementally adds the words of a given Text to this corpora
index. |
void |
clearHash()
Recreates the current index main table hstab . |
void |
codeFile(String infile,
String outfile)
Codifies a file according to the loaded dictionary. |
void |
codify(Sentence[] vs)
Codifies any "Word" contained in an array of Sentences, according to this dictionary. |
static void |
codifyOnFly(ChunkedSentence[] sentences)
Codification "on the fly" for an array of chunked sentences. |
static CorpusIndex |
codifyOnFly(Sentence... sentences)
Codification "on the fly" for a given array of sentences. |
static void |
demoForWeb()
|
int |
freq(String token)
Gives the token frequency. |
String |
get(int key)
Get the token from a given code key. |
String |
get(int[] vkeys)
Given an array of codes, expecting to represent a word sequence, like for example a sentence, it returns its corresponding string form. |
int |
get(String token)
Get the code from a given token. |
String |
getEncoding()
Gives the current encoding string, used to read corpora files. |
void |
load(CorpusIndex d)
Redefines this corpus index, based on an already existing one. |
boolean |
load(String fname)
Loads a given corpora index from a binary file, previously saved by and instance of this class, through the method: save(String) . |
boolean |
loadASCIIDictionary(String filename)
Loads a corpus index table from a text file. |
static void |
main(String[] args)
This "main" method enables the command line execution of this class in order to create a given corpus dictionary. |
boolean |
printDict(PrintStream out)
Prints the corpus index in a text file. |
boolean |
printDict(String fout)
Prints the corpus index in a text file (see printDict(PrintStream) ). |
static void |
printHelp()
Prints the set of arguments that can be passed through the command line ( main ). |
boolean |
readCorpus(String filename)
Reads a corpus text file, recreating the index. |
boolean |
readCorpus(String filename,
boolean adding)
Reads a corpus text file, incrementally adding their new "unseen" words to this object. |
boolean |
readCorpus(Vector<String> vtokens)
Recreates the index from a list of string tokens, presumably words. |
void |
rebuild()
Recreates the corpus index upon the text loaded so far. |
boolean |
save(String fname)
Saves this object to a binary file. |
void |
setEncoding(String encode)
Defines a new encoding for reading corpora text files. |
static Vector<String> |
splitWords(String s)
Splits a given string sentence in a list of words. |
int |
sum()
Sums the frequencies for all tokens. |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
public TreeMap<String,Integer> sdict
public TreeMap<Integer,String> idict
public Hashtable<String,Integer> hstab
public int TRUNCV
TRUNCV
maximum length.
public static int NO_TRUNC
TRUNCV
Constructor Detail |
---|
public CorpusIndex()
clearHash()
method. The
default encoding is UTF-8
.
public CorpusIndex(int truncv)
clearHash()
method, and sets the word
truncation value.
truncv
- The word truncation value
(see TRUNCV
).Method Detail |
---|
public final void clearHash()
hstab
.
public static Vector<String> splitWords(String s)
s
- The string sentence.
s
.public boolean readCorpus(Vector<String> vtokens)
vtokens
- The list of string tokens.
true
value on success, and false
if
some erroneous situation occurs.public boolean readCorpus(String filename)
readCorpus(filename, false)
.
filename
- The file name from which text will be read.
true
value on success, and false
if
some erroneous situation occurs.public boolean readCorpus(String filename, boolean adding)
adding = false
.
filename
- The file name from which the corpus is read.adding
- A flag that determines whether previously read corpora data
should be maintained, or cleaned.
true
value on success, and false
if
some erroneous situation occurs.public void addText(Text txt)
Text
to this corpora
index. This method should be adequately used and combined with the
methods clearHash()
and rebuild()
, as exemplified below:
CorpusIndex dic= new CorpusIndex(); dic.clearHash(); dic.addText(txt1); dic.addText(txt2); dic.addText(txt3); dic.rebuild();
txt
- The text to be added to this index.public void add(Sentence stc)
Sentence
to
this corpora index. Operates similarly to
addText(Text)
.
stc
- The sentence to be added to this index.public void add(String str)
str
- The input string.public void add(String[] vs)
add(String)
method.
vs
- The array of strings to be processed and integrated.public void add(Sentence[] vs)
add(Sentence)
method.
vs
- The array of sentences from which to add the words.public void rebuild()
public boolean printDict(String fout)
printDict(PrintStream)
).
fout
- The file name into which the corpus index is going to be printed.
true
value on success, and false
if some erroneous
situation occurs.public boolean printDict(PrintStream out)
KEY WORD FREQ
", for
example: 10045 economy 2795
.
out
- The file stram into which the corpus index is going to be printed.
true
value on success, and false
if some erroneous
situation occurs.public boolean loadASCIIDictionary(String filename)
KEY TOKEN FREQ
, similarly to
the scheme and example shown in method printDict(PrintStream)
. This last one is the symmetric method of this one.
filename
- The file name from which to load the table.
true
value on success, and false
if some erroneous
situation occurs.public void load(CorpusIndex d)
d
- The new index that redefines this object.public boolean load(String fname)
save(String)
.
fname
- The file name from which to read.
true
value on success, and false
if
some erroneous situation occurs.public boolean save(String fname)
true
value on success, and false
if some erroneous situation occurs.public void codeFile(String infile, String outfile)
infile
- The file to be codified.outfile
- The generated codified file.public String get(int key)
public String get(int[] vkeys)
vkeys
- int[] The array of word keys
public int get(String token)
token
- The token string.
public int freq(String token)
token
-
public int sum()
public void codify(Sentence[] vs)
vs
- public static CorpusIndex codifyOnFly(Sentence... sentences)
sentences
- The array of sentences to be codified.public static void codifyOnFly(ChunkedSentence[] sentences)
codifyOnFly(Sentence[] sentences)
method
sentences
- The array of chunked sentences to be codified.public void setEncoding(String encode)
encode
- The encoding string, for example:
UTF-8
, or ISO-8859-1
.public String getEncoding()
public static void printHelp()
main
).
public static void demoForWeb()
public static void main(String[] args)
args
- Should comply with the syntax defined in the
printHelp()
method.
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |