hultig.sumo
Class CorpusIndex

java.lang.Object
  extended by hultig.sumo.CorpusIndex
All Implemented Interfaces:
Serializable

public class CorpusIndex
extends Object
implements Serializable

Represents a corpora lexical index, by associating a unique number, the index, to each word. This main goal of this class is to have a more efficient (faster) text processing.

The text corpus may be incrementally added, file by file, and the redefinition of the index is executed by invoking the rebuild() method. The dictionary is reset through the clearHash() method.

University of Beira Interior (UBI)
Centre For Human Language Technology and Bioinformatics (HULTIG)

See Also:
Serialized Form

Field Summary
 Hashtable<String,Integer> hstab
          An hash table for counting word frequencies in corpora.
 TreeMap<Integer,String> idict
          A corpora index with the numeric index being the key.
static int NO_TRUNC
          The code for disabling word truncation, see TRUNCV
 TreeMap<String,Integer> sdict
          A corpora index with the words/tokens being the keys.
 int TRUNCV
          Size of word truncation.
 
Constructor Summary
CorpusIndex()
          The default constructor initializes the class main properties and components, by also calling the clearHash() method.
CorpusIndex(int truncv)
          Provides the main initializations on this class, by also calling the clearHash() method, and sets the word truncation value.
 
Method Summary
 void add(Sentence stc)
          Incrementally adds the words of a given Sentence to this corpora index.
 void add(Sentence[] vs)
          Adds the words contained in an array os sentences to this corpus index.
 void add(String str)
          Adds the words contained in a given string to this corpus index.
 void add(String[] vs)
          Adds the words contained in an array of strings to this corpus index.
 void addText(Text txt)
          Incrementally adds the words of a given Text to this corpora index.
 void clearHash()
          Recreates the current index main table hstab.
 void codeFile(String infile, String outfile)
          Codifies a file according to the loaded dictionary.
 void codify(Sentence[] vs)
          Codifies any "Word" contained in an array of Sentences, according to this dictionary.
static void codifyOnFly(ChunkedSentence[] sentences)
          Codification "on the fly" for an array of chunked sentences.
static CorpusIndex codifyOnFly(Sentence... sentences)
          Codification "on the fly" for a given array of sentences.
static void demoForWeb()
           
 int freq(String token)
          Gives the token frequency.
 String get(int key)
          Get the token from a given code key.
 String get(int[] vkeys)
          Given an array of codes, expecting to represent a word sequence, like for example a sentence, it returns its corresponding string form.
 int get(String token)
          Get the code from a given token.
 String getEncoding()
          Gives the current encoding string, used to read corpora files.
 void load(CorpusIndex d)
          Redefines this corpus index, based on an already existing one.
 boolean load(String fname)
          Loads a given corpora index from a binary file, previously saved by and instance of this class, through the method: save(String).
 boolean loadASCIIDictionary(String filename)
          Loads a corpus index table from a text file.
static void main(String[] args)
          This "main" method enables the command line execution of this class in order to create a given corpus dictionary.
 boolean printDict(PrintStream out)
          Prints the corpus index in a text file.
 boolean printDict(String fout)
          Prints the corpus index in a text file (see printDict(PrintStream)).
static void printHelp()
          Prints the set of arguments that can be passed through the command line (main).
 boolean readCorpus(String filename)
          Reads a corpus text file, recreating the index.
 boolean readCorpus(String filename, boolean adding)
          Reads a corpus text file, incrementally adding their new "unseen" words to this object.
 boolean readCorpus(Vector<String> vtokens)
          Recreates the index from a list of string tokens, presumably words.
 void rebuild()
          Recreates the corpus index upon the text loaded so far.
 boolean save(String fname)
          Saves this object to a binary file.
 void setEncoding(String encode)
          Defines a new encoding for reading corpora text files.
static Vector<String> splitWords(String s)
          Splits a given string sentence in a list of words.
 int sum()
          Sums the frequencies for all tokens.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

sdict

public TreeMap<String,Integer> sdict
A corpora index with the words/tokens being the keys. Given a word we can obtain its numeric index.


idict

public TreeMap<Integer,String> idict
A corpora index with the numeric index being the key. Given a numeric index we can get the corresponding word.


hstab

public Hashtable<String,Integer> hstab
An hash table for counting word frequencies in corpora.


TRUNCV

public int TRUNCV
Size of word truncation. If this value is greater than zero, the corpora read tokens will be truncated, they are stored with TRUNCV maximum length.


NO_TRUNC

public static int NO_TRUNC
The code for disabling word truncation, see TRUNCV

Constructor Detail

CorpusIndex

public CorpusIndex()
The default constructor initializes the class main properties and components, by also calling the clearHash() method. The default encoding is UTF-8.


CorpusIndex

public CorpusIndex(int truncv)
Provides the main initializations on this class, by also calling the clearHash() method, and sets the word truncation value.

Parameters:
truncv - The word truncation value (see TRUNCV).
Method Detail

clearHash

public final void clearHash()
Recreates the current index main table hstab.


splitWords

public static Vector<String> splitWords(String s)
Splits a given string sentence in a list of words.

Parameters:
s - The string sentence.
Returns:
The list of words/tokens found in s.

readCorpus

public boolean readCorpus(Vector<String> vtokens)
Recreates the index from a list of string tokens, presumably words.

Parameters:
vtokens - The list of string tokens.
Returns:
The true value on success, and false if some erroneous situation occurs.

readCorpus

public boolean readCorpus(String filename)
Reads a corpus text file, recreating the index. This method calls readCorpus(filename, false).

Parameters:
filename - The file name from which text will be read.
Returns:
The true value on success, and false if some erroneous situation occurs.

readCorpus

public boolean readCorpus(String filename,
                          boolean adding)
Reads a corpus text file, incrementally adding their new "unseen" words to this object. The index is only recreated if adding = false.

Parameters:
filename - The file name from which the corpus is read.
adding - A flag that determines whether previously read corpora data should be maintained, or cleaned.
Returns:
The true value on success, and false if some erroneous situation occurs.

addText

public void addText(Text txt)
Incrementally adds the words of a given Text to this corpora index. This method should be adequately used and combined with the methods clearHash() and rebuild(), as exemplified below:
    CorpusIndex dic= new CorpusIndex();
    dic.clearHash();
    dic.addText(txt1);
    dic.addText(txt2);
    dic.addText(txt3);
    dic.rebuild();
 

Parameters:
txt - The text to be added to this index.

add

public void add(Sentence stc)
Incrementally adds the words of a given Sentence to this corpora index. Operates similarly to addText(Text).

Parameters:
stc - The sentence to be added to this index.

add

public void add(String str)
Adds the words contained in a given string to this corpus index.

Parameters:
str - The input string.

add

public void add(String[] vs)
Adds the words contained in an array of strings to this corpus index. This method invokes the add(String) method.

Parameters:
vs - The array of strings to be processed and integrated.

add

public void add(Sentence[] vs)
Adds the words contained in an array os sentences to this corpus index. This method invokes the add(Sentence) method.

Parameters:
vs - The array of sentences from which to add the words.

rebuild

public void rebuild()
Recreates the corpus index upon the text loaded so far. The numeric indexes are recomputed.


printDict

public boolean printDict(String fout)
Prints the corpus index in a text file (see printDict(PrintStream)).

Parameters:
fout - The file name into which the corpus index is going to be printed.
Returns:
The true value on success, and false if some erroneous situation occurs.

printDict

public boolean printDict(PrintStream out)
Prints the corpus index in a text file. Each word is printed with its numeric index and its the frequency, on word per line, in the format "KEY WORD FREQ", for example: 10045 economy 2795.

Parameters:
out - The file stram into which the corpus index is going to be printed.
Returns:
The true value on success, and false if some erroneous situation occurs.

loadASCIIDictionary

public boolean loadASCIIDictionary(String filename)
Loads a corpus index table from a text file. The expected format is a 3-tuple per line as follows: KEY TOKEN FREQ, similarly to the scheme and example shown in method printDict(PrintStream). This last one is the symmetric method of this one.

Parameters:
filename - The file name from which to load the table.
Returns:
The true value on success, and false if some erroneous situation occurs.

load

public void load(CorpusIndex d)
Redefines this corpus index, based on an already existing one.

Parameters:
d - The new index that redefines this object.

load

public boolean load(String fname)
Loads a given corpora index from a binary file, previously saved by and instance of this class, through the method: save(String).

Parameters:
fname - The file name from which to read.
Returns:
The true value on success, and false if some erroneous situation occurs.

save

public boolean save(String fname)
Saves this object to a binary file.

Returns:
The true value on success, and false if some erroneous situation occurs.

codeFile

public void codeFile(String infile,
                     String outfile)
Codifies a file according to the loaded dictionary.

Parameters:
infile - The file to be codified.
outfile - The generated codified file.

get

public String get(int key)
Get the token from a given code key. If such key do not belong to this dictionary then the null value is returned.


get

public String get(int[] vkeys)
Given an array of codes, expecting to represent a word sequence, like for example a sentence, it returns its corresponding string form.

Parameters:
vkeys - int[] The array of word keys
Returns:
String

get

public int get(String token)
Get the code from a given token.

Parameters:
token - The token string.
Returns:
The code or -1 if something is wrong.

freq

public int freq(String token)
Gives the token frequency.

Parameters:
token -
Returns:
The -1 value when a given token was not found in this dictionary.

sum

public int sum()
Sums the frequencies for all tokens.

Returns:
The sum or else -1 meaning that the dictionary hashtable is not defined.

codify

public void codify(Sentence[] vs)
Codifies any "Word" contained in an array of Sentences, according to this dictionary. By "codifying" here we mean that any word get its dictionary index.

Parameters:
vs -

codifyOnFly

public static CorpusIndex codifyOnFly(Sentence... sentences)
Codification "on the fly" for a given array of sentences. It means that the dictionary is automatically created for the received array of sentences and their words are codified accordingly.

Parameters:
sentences - The array of sentences to be codified.

codifyOnFly

public static void codifyOnFly(ChunkedSentence[] sentences)
Codification "on the fly" for an array of chunked sentences. This method is similar to the codifyOnFly(Sentence[] sentences) method

Parameters:
sentences - The array of chunked sentences to be codified.

setEncoding

public void setEncoding(String encode)
Defines a new encoding for reading corpora text files.

Parameters:
encode - The encoding string, for example: UTF-8, or ISO-8859-1.

getEncoding

public String getEncoding()
Gives the current encoding string, used to read corpora files.

Returns:
The encoding string.

printHelp

public static void printHelp()
Prints the set of arguments that can be passed through the command line (main).


demoForWeb

public static void demoForWeb()

main

public static void main(String[] args)
This "main" method enables the command line execution of this class in order to create a given corpus dictionary.

Parameters:
args - Should comply with the syntax defined in the printHelp() method.