CorpusIndex

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

hultig.sumo
Class CorpusIndex

java.lang.Object
  hultig.sumo.CorpusIndex

All Implemented Interfaces:: Serializable

public class CorpusIndex
extends Object
implements Serializable
extends Object
implements Serializable

Represents a corpora lexical index, by associating a unique number, the index, to each word. This main goal of this class is to have a more efficient (faster) text processing.

The text corpus may be incrementally added, file by file, and the redefinition of the index is executed by invoking the rebuild() method. The dictionary is reset through the clearHash() method.

University of Beira Interior (UBI)
Centre For Human Language Technology and Bioinformatics (HULTIG)

See Also:: Serialized Form

Field Summary
`Hashtable<String,Integer>`	`hstab` An hash table for counting word frequencies in corpora.
`TreeMap<Integer,String>`	`idict` A corpora index with the numeric index being the key.
`static int`	`NO_TRUNC` The code for disabling word truncation, see `TRUNCV`
`TreeMap<String,Integer>`	`sdict` A corpora index with the words/tokens being the keys.
`int`	`TRUNCV` Size of word truncation.

Constructor Summary
`CorpusIndex()` The default constructor initializes the class main properties and components, by also calling the `clearHash()` method.
`CorpusIndex(int truncv)` Provides the main initializations on this class, by also calling the `clearHash()` method, and sets the word truncation value.

Method Summary
`void`	`add(Sentence stc)` Incrementally adds the words of a given `Sentence` to this corpora index.
`void`	`add(Sentence[] vs)` Adds the words contained in an array os sentences to this corpus index.
`void`	`add(String str)` Adds the words contained in a given string to this corpus index.
`void`	`add(String[] vs)` Adds the words contained in an array of strings to this corpus index.
`void`	`addText(Text txt)` Incrementally adds the words of a given `Text` to this corpora index.
`void`	`clearHash()` Recreates the current index main table `hstab`.
`void`	`codeFile(String infile, String outfile)` Codifies a file according to the loaded dictionary.
`void`	`codify(Sentence[] vs)` Codifies any "Word" contained in an array of Sentences, according to this dictionary.
`static void`	`codifyOnFly(ChunkedSentence[] sentences)` Codification "on the fly" for an array of chunked sentences.
`static CorpusIndex`	`codifyOnFly(Sentence... sentences)` Codification "on the fly" for a given array of sentences.
`static void`	`demoForWeb()`
`int`	`freq(String token)` Gives the token frequency.
`String`	`get(int key)` Get the token from a given code key.
`String`	`get(int[] vkeys)` Given an array of codes, expecting to represent a word sequence, like for example a sentence, it returns its corresponding string form.
`int`	`get(String token)` Get the code from a given token.
`String`	`getEncoding()` Gives the current encoding string, used to read corpora files.
`void`	`load(CorpusIndex d)` Redefines this corpus index, based on an already existing one.
`boolean`	`load(String fname)` Loads a given corpora index from a binary file, previously saved by and instance of this class, through the method: `save(String)`.
`boolean`	`loadASCIIDictionary(String filename)` Loads a corpus index table from a text file.
`static void`	`main(String[] args)` This "main" method enables the command line execution of this class in order to create a given corpus dictionary.
`boolean`	`printDict(PrintStream out)` Prints the corpus index in a text file.
`boolean`	`printDict(String fout)` Prints the corpus index in a text file (see `printDict(PrintStream)`).
`static void`	`printHelp()` Prints the set of arguments that can be passed through the command line (`main`).
`boolean`	`readCorpus(String filename)` Reads a corpus text file, recreating the index.
`boolean`	`readCorpus(String filename, boolean adding)` Reads a corpus text file, incrementally adding their new "unseen" words to this object.
`boolean`	`readCorpus(Vector<String> vtokens)` Recreates the index from a list of string tokens, presumably words.
`void`	`rebuild()` Recreates the corpus index upon the text loaded so far.
`boolean`	`save(String fname)` Saves this object to a binary file.
`void`	`setEncoding(String encode)` Defines a new encoding for reading corpora text files.
`static Vector<String>`	`splitWords(String s)` Splits a given string sentence in a list of words.
`int`	`sum()` Sums the frequencies for all tokens.

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Field Detail

sdict

public TreeMap<String,Integer> sdict

A corpora index with the words/tokens being the keys. Given a word we can obtain its numeric index.

idict

public TreeMap<Integer,String> idict

A corpora index with the numeric index being the key. Given a numeric index we can get the corresponding word.

hstab

public Hashtable<String,Integer> hstab

An hash table for counting word frequencies in corpora.

TRUNCV

public int TRUNCV

Size of word truncation. If this value is greater than zero, the corpora read tokens will be truncated, they are stored with TRUNCV maximum length.

NO_TRUNC

public static int NO_TRUNC

The code for disabling word truncation, see TRUNCV

Constructor Detail

CorpusIndex

public CorpusIndex()

The default constructor initializes the class main properties and components, by also calling the clearHash() method. The default encoding is UTF-8.

CorpusIndex

public CorpusIndex(int truncv)

Provides the main initializations on this class, by also calling the clearHash() method, and sets the word truncation value.

Parameters:: truncv - The word truncation value (see TRUNCV).

Method Detail

clearHash

public final void clearHash()

Recreates the current index main table hstab.

splitWords

public static Vector<String> splitWords(String s)

Splits a given string sentence in a list of words.

Parameters:: s - The string sentence.
Returns:: The list of words/tokens found in s.

readCorpus

public boolean readCorpus(Vector<String> vtokens)

Recreates the index from a list of string tokens, presumably words.

Parameters:: vtokens - The list of string tokens.
Returns:: The true value on success, and false if some erroneous situation occurs.

readCorpus

public boolean readCorpus(String filename)

Reads a corpus text file, recreating the index. This method calls readCorpus(filename, false).

Parameters:: filename - The file name from which text will be read.
Returns:: The true value on success, and false if some erroneous situation occurs.

readCorpus

public boolean readCorpus(String filename,
                          boolean adding)

Reads a corpus text file, incrementally adding their new "unseen" words to this object. The index is only recreated if adding = false.

Parameters:: filename - The file name from which the corpus is read.; adding - A flag that determines whether previously read corpora data should be maintained, or cleaned.
Returns:: The true value on success, and false if some erroneous situation occurs.

addText

public void addText(Text txt)

Incrementally adds the words of a given Text to this corpora index. This method should be adequately used and combined with the methods clearHash() and rebuild(), as exemplified below:

    CorpusIndex dic= new CorpusIndex();
    dic.clearHash();
    dic.addText(txt1);
    dic.addText(txt2);
    dic.addText(txt3);
    dic.rebuild();

Parameters:: txt - The text to be added to this index.

add

public void add(Sentence stc)

Incrementally adds the words of a given Sentence to this corpora index. Operates similarly to addText(Text).

Parameters:: stc - The sentence to be added to this index.

add

public void add(String str)

Adds the words contained in a given string to this corpus index.

Parameters:: str - The input string.

add

public void add(String[] vs)

Adds the words contained in an array of strings to this corpus index. This method invokes the add(String) method.

Parameters:: vs - The array of strings to be processed and integrated.

add

public void add(Sentence[] vs)

Adds the words contained in an array os sentences to this corpus index. This method invokes the add(Sentence) method.

Parameters:: vs - The array of sentences from which to add the words.

rebuild

public void rebuild()

Recreates the corpus index upon the text loaded so far. The numeric indexes are recomputed.

printDict

public boolean printDict(String fout)

Prints the corpus index in a text file (see printDict(PrintStream)).

Parameters:: fout - The file name into which the corpus index is going to be printed.
Returns:: The true value on success, and false if some erroneous situation occurs.

printDict

public boolean printDict(PrintStream out)

Prints the corpus index in a text file. Each word is printed with its numeric index and its the frequency, on word per line, in the format "KEY WORD FREQ", for example: 10045 economy 2795.

Parameters:: out - The file stram into which the corpus index is going to be printed.
Returns:: The true value on success, and false if some erroneous situation occurs.

loadASCIIDictionary

public boolean loadASCIIDictionary(String filename)

Loads a corpus index table from a text file. The expected format is a 3-tuple per line as follows: KEY TOKEN FREQ, similarly to the scheme and example shown in method printDict(PrintStream). This last one is the symmetric method of this one.

Parameters:: filename - The file name from which to load the table.
Returns:: The true value on success, and false if some erroneous situation occurs.

load

public void load(CorpusIndex d)

Redefines this corpus index, based on an already existing one.

Parameters:: d - The new index that redefines this object.

load

public boolean load(String fname)

Loads a given corpora index from a binary file, previously saved by and instance of this class, through the method: save(String).

Parameters:: fname - The file name from which to read.
Returns:: The true value on success, and false if some erroneous situation occurs.

save

public boolean save(String fname)

Saves this object to a binary file.

Returns:: The true value on success, and false if some erroneous situation occurs.

codeFile

public void codeFile(String infile,
                     String outfile)

Codifies a file according to the loaded dictionary.

Parameters:: infile - The file to be codified.; outfile - The generated codified file.

get

public String get(int key)

Get the token from a given code key. If such key do not belong to this dictionary then the null value is returned.

get

public String get(int[] vkeys)

Given an array of codes, expecting to represent a word sequence, like for example a sentence, it returns its corresponding string form.

Parameters:: vkeys - int[] The array of word keys
Returns:: String

get

public int get(String token)

Get the code from a given token.

Parameters:: token - The token string.
Returns:: The code or -1 if something is wrong.

freq

public int freq(String token)

Gives the token frequency.

Parameters:: token -
Returns:: The -1 value when a given token was not found in this dictionary.

sum

public int sum()

Sums the frequencies for all tokens.

Returns:: The sum or else -1 meaning that the dictionary hashtable is not defined.

codify

public void codify(Sentence[] vs)

Codifies any "Word" contained in an array of Sentences, according to this dictionary. By "codifying" here we mean that any word get its dictionary index.

Parameters:: vs -

codifyOnFly

public static CorpusIndex codifyOnFly(Sentence... sentences)

Codification "on the fly" for a given array of sentences. It means that the dictionary is automatically created for the received array of sentences and their words are codified accordingly.

Parameters:: sentences - The array of sentences to be codified.

codifyOnFly

public static void codifyOnFly(ChunkedSentence[] sentences)

Codification "on the fly" for an array of chunked sentences. This method is similar to the codifyOnFly(Sentence[] sentences) method

Parameters:: sentences - The array of chunked sentences to be codified.

setEncoding

public void setEncoding(String encode)

Defines a new encoding for reading corpora text files.

Parameters:: encode - The encoding string, for example: UTF-8, or ISO-8859-1.

getEncoding

public String getEncoding()

Gives the current encoding string, used to read corpora files.

Returns:: The encoding string.

printHelp

public static void printHelp()

Prints the set of arguments that can be passed through the command line (main).

demoForWeb

public static void demoForWeb()

main

public static void main(String[] args)

This "main" method enables the command line execution of this class in order to create a given corpus dictionary.

Parameters:: args - Should comply with the syntax defined in the printHelp() method.

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

hultig.sumo Class CorpusIndex

sdict

idict

hstab

TRUNCV

NO_TRUNC

CorpusIndex

CorpusIndex

clearHash

splitWords

readCorpus

readCorpus

readCorpus

addText

add

add

add

add

rebuild

printDict

printDict

loadASCIIDictionary

load

load

save

codeFile

get

get

get

freq

sum

codify

codifyOnFly

codifyOnFly

setEncoding

getEncoding

printHelp

demoForWeb

main

hultig.sumo
Class CorpusIndex