hultig.sumo
Class Text

java.lang.Object
  extended by java.util.AbstractCollection<E>
      extended by java.util.AbstractList<E>
          extended by java.util.AbstractSequentialList<E>
              extended by java.util.LinkedList<Sentence>
                  extended by hultig.sumo.Text
All Implemented Interfaces:
Serializable, Cloneable, Iterable<Sentence>, Collection<Sentence>, Deque<Sentence>, List<Sentence>, Queue<Sentence>

public final class Text
extends LinkedList<Sentence>

A class to represent and manage text. It can represent a textual document or even a list of independent sentences, since internaly it is represented through a linked list of sentences.

University of Beira Interior (UBI)
Centre For Human Language Technology and Bioinformatics (HULTIG)

See Also:
Serialized Form

Field Summary
 
Fields inherited from class java.util.AbstractList
modCount
 
Constructor Summary
Text()
          The default constructor.
Text(String s)
          Creates a new text from a given string.
Text(String[] vs)
          Creates a text from an array of strings.
Text(String s, OpenNLPKit onlpk)
          Creates a new text from a given string.
 
Method Summary
 boolean add(Sentence s)
          Adds a sentence to this text, by inserting it at the end of the list (appending a sentence).
 boolean add(String s)
          Add all the sentences contained in the readLn string to this file.
 boolean add(String stxt, OpenNLPKit onlpk)
          Add all the sentences contained in the readLn string to this file.
 void add(Text t)
          Adds all the sentences contained in another Text object, to this text.
 void codify()
          Codifies this text according to the corpus index referenced by CINDEX.
 void codify(CorpusIndex idx)
          Codifies every word from this text uppon a given corpus index (CorpusIndex).
 void cutIfLessThan(int numwords)
          Eliminate all sentences having less words than a given minimum number.
 int freq(String sw)
           
 CorpusIndex getCorpusIndex()
          Gives the reference of the corpus index stored in this object, an possibly used to codify the text.
 int getNumTokens()
           
 Sentence getSentence(int index)
          Gives the i-th sentence from this text.
 Sentence[] getSentences()
          Gives an array with all the sentences from this text.
 String[] getVocab()
           
 String getWord(int i, int j)
          Tries to return the string of the j-th word from the i-th sentence of this text.
static void main(String[] argv)
          The main method contains a general class tester.
 void print()
          Outputs the text sentences, one sentence per line.
 void print(String sleft, String sright, boolean withIndex)
          Outputs the text sentences, one sentence per line.
 void printVocabulary()
           
 double prob(String sw)
           
 void randomDrop(int n)
          Eliminates randomly n sentences from this text.
 boolean readFile(String filename)
          Add all the sentences contained in a given text file to this text object.
 void removeDuplicates()
          Remove duplicate sentences from this text.
 boolean saveFile()
          Save this text to a new file with the name equal to the current time stamp in the format: YYYYMTDDHHMMSS.txt, with YYYY, MT, DD, HH, MM, SS representing respectively the year, month, day, hour, minute, and second.
 boolean saveFile(String filename)
          Saves the current text to a given file.
 boolean shuffle(Random r)
          Shuffles randomly the sentences in this text.
 double similarity(Text othr)
          Computes a lexical similarity between two texts, based on local evidence.
static boolean testSimilarity()
           
 void toLowerCase()
          Turns every word from this file to lower case.
 String toString()
          Gives a concatenation of the sentences from this text.
 String toString(String separator)
          Gives a concatenation of the sentences from this text.
 
Methods inherited from class java.util.LinkedList
add, addAll, addAll, addFirst, addLast, clear, clone, contains, descendingIterator, element, get, getFirst, getLast, indexOf, lastIndexOf, listIterator, offer, offerFirst, offerLast, peek, peekFirst, peekLast, poll, pollFirst, pollLast, pop, push, remove, remove, remove, removeFirst, removeFirstOccurrence, removeLast, removeLastOccurrence, set, size, toArray, toArray
 
Methods inherited from class java.util.AbstractSequentialList
iterator
 
Methods inherited from class java.util.AbstractList
equals, hashCode, listIterator, removeRange, subList
 
Methods inherited from class java.util.AbstractCollection
containsAll, isEmpty, removeAll, retainAll
 
Methods inherited from class java.lang.Object
finalize, getClass, notify, notifyAll, wait, wait, wait
 
Methods inherited from interface java.util.List
containsAll, equals, hashCode, isEmpty, iterator, listIterator, removeAll, retainAll, subList
 
Methods inherited from interface java.util.Deque
iterator
 

Constructor Detail

Text

public Text()
The default constructor.


Text

public Text(String s)
Creates a new text from a given string. The string may contain more than one sentence. Each sentence will be detected and inserted.

Parameters:
s - The text string.

Text

public Text(String[] vs)
Creates a text from an array of strings. Each string may contain more than one sentence, similarly to Text(String s).

Parameters:
vs - The array of strings.

Text

public Text(String s,
            OpenNLPKit onlpk)
Creates a new text from a given string. The string may contain more than one sentence. Each sentence will be detected and inserted. This constructor uses the OpenNLP sentence detector, which is now more accurate, based on the trained model for the English language. See the OpenNLP project.

Parameters:
s - The text string.
onlpk - The OpenNLP Kit.
Method Detail

add

public boolean add(String s)
Add all the sentences contained in the readLn string to this file. Sentence boundary detection is made through Scott Piao's package. Defined on 2007/03/09

Parameters:
s - The text string.
Returns:
True if sentences were added to this file.

add

public boolean add(String stxt,
                   OpenNLPKit onlpk)
Add all the sentences contained in the readLn string to this file. Sentence boundary detection is made through an OpenNLP model, which performs better, based on the trained model for the English language. See the OpenNLP project.

Parameters:
stxt - The text string.
onlpk - The OpenNLP Kit.
Returns:
The true value on success.

add

public boolean add(Sentence s)
Adds a sentence to this text, by inserting it at the end of the list (appending a sentence).

Specified by:
add in interface Collection<Sentence>
Specified by:
add in interface Deque<Sentence>
Specified by:
add in interface List<Sentence>
Specified by:
add in interface Queue<Sentence>
Overrides:
add in class LinkedList<Sentence>
Parameters:
s -
Returns:

add

public void add(Text t)
Adds all the sentences contained in another Text object, to this text.

Parameters:
t - The other text object.

cutIfLessThan

public void cutIfLessThan(int numwords)
Eliminate all sentences having less words than a given minimum number.

Parameters:
numwords - The minimum number of words.

readFile

public boolean readFile(String filename)
Add all the sentences contained in a given text file to this text object. The new sentences will be sequentially added after the possibly already existing ones.

Parameters:
filename - The file from which to read the sentences.

saveFile

public boolean saveFile()
Save this text to a new file with the name equal to the current time stamp in the format: YYYYMTDDHHMMSS.txt, with YYYY, MT, DD, HH, MM, SS representing respectively the year, month, day, hour, minute, and second. For example: "20100320171545.txt"

Returns:
The true value on success.

saveFile

public boolean saveFile(String filename)
Saves the current text to a given file.

Parameters:
filename - The name of the saved file.
Returns:
The true value on success.

toLowerCase

public void toLowerCase()
Turns every word from this file to lower case.


codify

public void codify(CorpusIndex idx)
Codifies every word from this text uppon a given corpus index (CorpusIndex). The corpus index reference will be stored in CINDEX.

Parameters:
idx - The corpus index.

similarity

public double similarity(Text othr)
Computes a lexical similarity between two texts, based on local evidence. This function uses an adaptation of the TF*IDF vector representation, without computing the heavy IDF component. Based on the Zipf law, the word length and the relative frequency are considered for the vectorial calculations.

Parameters:
othr - The other sentence.
Returns:
The similarity value in the [0, 1] interval.

codify

public void codify()
Codifies this text according to the corpus index referenced by CINDEX.


getCorpusIndex

public CorpusIndex getCorpusIndex()
Gives the reference of the corpus index stored in this object, an possibly used to codify the text.

Returns:
The corpus index reference.

getWord

public String getWord(int i,
                      int j)
Tries to return the string of the j-th word from the i-th sentence of this text.

Parameters:
i - The sentence index position.
j - The word index position in a given sentence.
Returns:
The string of the word or null if it do not exist, for the indicated i and j indexes.

getSentence

public Sentence getSentence(int index)
Gives the i-th sentence from this text.

Parameters:
index - The sentence index in the text.
Returns:
The i-th sentence or null if not found for the given index.

getSentences

public Sentence[] getSentences()
Gives an array with all the sentences from this text.

Returns:
The array of sentences.

getNumTokens

public int getNumTokens()

getVocab

public String[] getVocab()

removeDuplicates

public void removeDuplicates()
Remove duplicate sentences from this text.


randomDrop

public void randomDrop(int n)
Eliminates randomly n sentences from this text.

Parameters:
n - The number of sentences to be eliminated.

shuffle

public boolean shuffle(Random r)
Shuffles randomly the sentences in this text.

Returns:
boolean

print

public void print()
Outputs the text sentences, one sentence per line.


print

public void print(String sleft,
                  String sright,
                  boolean withIndex)
Outputs the text sentences, one sentence per line. Each sentence will be surrounded by a left and a right sequence and may be marked with its sequential index. right string.

Parameters:
sleft - The left string context.
sright - The right string context.
withIndex - Print the sequential sentence index.

freq

public int freq(String sw)

prob

public double prob(String sw)

printVocabulary

public void printVocabulary()

toString

public String toString()
Gives a concatenation of the sentences from this text.

Overrides:
toString in class AbstractCollection<Sentence>
Returns:
A string representing this text.

toString

public String toString(String separator)
Gives a concatenation of the sentences from this text. Between each sentence a given separator is inserted.

Parameters:
separator - The separator connecting two sentences
Returns:
A string representation of this text.

main

public static void main(String[] argv)
The main method contains a general class tester.

Parameters:
argv - One parameter may be indicated, containing the path to a file to be processed.

testSimilarity

public static boolean testSimilarity()