Text

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

hultig.sumo
Class Text

java.lang.Object
  java.util.AbstractCollection<E>
      java.util.AbstractList<E>
          java.util.AbstractSequentialList<E>
              java.util.LinkedList<Sentence>
                  hultig.sumo.Text

All Implemented Interfaces:: Serializable, Cloneable, Iterable<Sentence>, Collection<Sentence>, Deque<Sentence>, List<Sentence>, Queue<Sentence>

public final class Text
extends LinkedList<Sentence>
extends LinkedList<Sentence>

A class to represent and manage text. It can represent a textual document or even a list of independent sentences, since internaly it is represented through a linked list of sentences.

University of Beira Interior (UBI)
Centre For Human Language Technology and Bioinformatics (HULTIG)

See Also:: Serialized Form

Field Summary

Fields inherited from class java.util.AbstractList
`modCount`

Constructor Summary
`Text()` The default constructor.
`Text(String s)` Creates a new text from a given string.
`Text(String[] vs)` Creates a text from an array of strings.
`Text(String s, OpenNLPKit onlpk)` Creates a new text from a given string.

Method Summary
`boolean`	`add(Sentence s)` Adds a sentence to this text, by inserting it at the end of the list (appending a sentence).
`boolean`	`add(String s)` Add all the sentences contained in the readLn string to this file.
`boolean`	`add(String stxt, OpenNLPKit onlpk)` Add all the sentences contained in the readLn string to this file.
`void`	`add(Text t)` Adds all the sentences contained in another Text object, to this text.
`void`	`codify()` Codifies this text according to the corpus index referenced by `CINDEX`.
`void`	`codify(CorpusIndex idx)` Codifies every word from this text uppon a given corpus index (CorpusIndex).
`void`	`cutIfLessThan(int numwords)` Eliminate all sentences having less words than a given minimum number.
`int`	`freq(String sw)`
`CorpusIndex`	`getCorpusIndex()` Gives the reference of the corpus index stored in this object, an possibly used to codify the text.
`int`	`getNumTokens()`
`Sentence`	`getSentence(int index)` Gives the i-th sentence from this text.
`Sentence[]`	`getSentences()` Gives an array with all the sentences from this text.
`String[]`	`getVocab()`
`String`	`getWord(int i, int j)` Tries to return the string of the j-th word from the i-th sentence of this text.
`static void`	`main(String[] argv)` The main method contains a general class tester.
`void`	`print()` Outputs the text sentences, one sentence per line.
`void`	`print(String sleft, String sright, boolean withIndex)` Outputs the text sentences, one sentence per line.
`void`	`printVocabulary()`
`double`	`prob(String sw)`
`void`	`randomDrop(int n)` Eliminates randomly n sentences from this text.
`boolean`	`readFile(String filename)` Add all the sentences contained in a given text file to this text object.
`void`	`removeDuplicates()` Remove duplicate sentences from this text.
`boolean`	`saveFile()` Save this text to a new file with the name equal to the current time stamp in the format: YYYYMTDDHHMMSS.txt, with YYYY, MT, DD, HH, MM, SS representing respectively the year, month, day, hour, minute, and second.
`boolean`	`saveFile(String filename)` Saves the current text to a given file.
`boolean`	`shuffle(Random r)` Shuffles randomly the sentences in this text.
`double`	`similarity(Text othr)` Computes a lexical similarity between two texts, based on local evidence.
`static boolean`	`testSimilarity()`
`void`	`toLowerCase()` Turns every word from this file to lower case.
`String`	`toString()` Gives a concatenation of the sentences from this text.
`String`	`toString(String separator)` Gives a concatenation of the sentences from this text.

Methods inherited from class java.util.LinkedList
`add, addAll, addAll, addFirst, addLast, clear, clone, contains, descendingIterator, element, get, getFirst, getLast, indexOf, lastIndexOf, listIterator, offer, offerFirst, offerLast, peek, peekFirst, peekLast, poll, pollFirst, pollLast, pop, push, remove, remove, remove, removeFirst, removeFirstOccurrence, removeLast, removeLastOccurrence, set, size, toArray, toArray`

Methods inherited from class java.util.AbstractSequentialList
`iterator`

Methods inherited from class java.util.AbstractList
`equals, hashCode, listIterator, removeRange, subList`

Methods inherited from class java.util.AbstractCollection
`containsAll, isEmpty, removeAll, retainAll`

Methods inherited from class java.lang.Object
`finalize, getClass, notify, notifyAll, wait, wait, wait`

Methods inherited from interface java.util.List
`containsAll, equals, hashCode, isEmpty, iterator, listIterator, removeAll, retainAll, subList`

Methods inherited from interface java.util.Deque
`iterator`

Constructor Detail

Text

public Text()

The default constructor.

Text

public Text(String s)

Creates a new text from a given string. The string may contain more than one sentence. Each sentence will be detected and inserted.

Parameters:: s - The text string.

Text

public Text(String[] vs)

Creates a text from an array of strings. Each string may contain more than one sentence, similarly to Text(String s).

Parameters:: vs - The array of strings.

Text

public Text(String s,
            OpenNLPKit onlpk)

Creates a new text from a given string. The string may contain more than one sentence. Each sentence will be detected and inserted. This constructor uses the OpenNLP sentence detector, which is now more accurate, based on the trained model for the English language. See the OpenNLP project.

Parameters:: s - The text string.; onlpk - The OpenNLP Kit.

Method Detail

add

public boolean add(String s)

Add all the sentences contained in the readLn string to this file. Sentence boundary detection is made through Scott Piao's package. Defined on 2007/03/09

Parameters:: s - The text string.
Returns:: True if sentences were added to this file.

add

public boolean add(String stxt,
                   OpenNLPKit onlpk)

Add all the sentences contained in the readLn string to this file. Sentence boundary detection is made through an OpenNLP model, which performs better, based on the trained model for the English language. See the OpenNLP project.

Parameters:: stxt - The text string.; onlpk - The OpenNLP Kit.
Returns:: The true value on success.

add

public boolean add(Sentence s)

Adds a sentence to this text, by inserting it at the end of the list (appending a sentence).

Specified by:: add in interface Collection<Sentence>
Specified by:: add in interface Deque<Sentence>
Specified by:: add in interface List<Sentence>
Specified by:: add in interface Queue<Sentence>
Overrides:: add in class LinkedList<Sentence>

Parameters:: s -
Returns:

add

public void add(Text t)

Adds all the sentences contained in another Text object, to this text.

Parameters:: t - The other text object.

cutIfLessThan

public void cutIfLessThan(int numwords)

Eliminate all sentences having less words than a given minimum number.

Parameters:: numwords - The minimum number of words.

readFile

public boolean readFile(String filename)

Add all the sentences contained in a given text file to this text object. The new sentences will be sequentially added after the possibly already existing ones.

Parameters:: filename - The file from which to read the sentences.

saveFile

public boolean saveFile()

Save this text to a new file with the name equal to the current time stamp in the format: YYYYMTDDHHMMSS.txt, with YYYY, MT, DD, HH, MM, SS representing respectively the year, month, day, hour, minute, and second. For example: "20100320171545.txt"

Returns:: The true value on success.

saveFile

public boolean saveFile(String filename)

Saves the current text to a given file.

Parameters:: filename - The name of the saved file.
Returns:: The true value on success.

toLowerCase

public void toLowerCase()

Turns every word from this file to lower case.

codify

public void codify(CorpusIndex idx)

Codifies every word from this text uppon a given corpus index (CorpusIndex). The corpus index reference will be stored in CINDEX.

Parameters:: idx - The corpus index.

similarity

public double similarity(Text othr)

Computes a lexical similarity between two texts, based on local evidence. This function uses an adaptation of the TF*IDF vector representation, without computing the heavy IDF component. Based on the Zipf law, the word length and the relative frequency are considered for the vectorial calculations.

Parameters:: othr - The other sentence.
Returns:: The similarity value in the [0, 1] interval.

codify

public void codify()

Codifies this text according to the corpus index referenced by CINDEX.

getCorpusIndex

public CorpusIndex getCorpusIndex()

Gives the reference of the corpus index stored in this object, an possibly used to codify the text.

Returns:: The corpus index reference.

getWord

public String getWord(int i,
                      int j)

Tries to return the string of the j-th word from the i-th sentence of this text.

Parameters:: i - The sentence index position.; j - The word index position in a given sentence.
Returns:: The string of the word or null if it do not exist, for the indicated i and j indexes.

getSentence

public Sentence getSentence(int index)

Gives the i-th sentence from this text.

Parameters:: index - The sentence index in the text.
Returns:: The i-th sentence or null if not found for the given index.

getSentences

public Sentence[] getSentences()

Gives an array with all the sentences from this text.

Returns:: The array of sentences.

getNumTokens

public int getNumTokens()

getVocab

public String[] getVocab()

removeDuplicates

public void removeDuplicates()

Remove duplicate sentences from this text.

randomDrop

public void randomDrop(int n)

Eliminates randomly n sentences from this text.

Parameters:: n - The number of sentences to be eliminated.

shuffle

public boolean shuffle(Random r)

Shuffles randomly the sentences in this text.

Returns:: boolean

print

public void print()

Outputs the text sentences, one sentence per line.

print

public void print(String sleft,
                  String sright,
                  boolean withIndex)

Outputs the text sentences, one sentence per line. Each sentence will be surrounded by a left and a right sequence and may be marked with its sequential index. right string.

Parameters:: sleft - The left string context.; sright - The right string context.; withIndex - Print the sequential sentence index.

freq

public int freq(String sw)

prob

public double prob(String sw)

printVocabulary

public void printVocabulary()

toString

public String toString()

Gives a concatenation of the sentences from this text.

Overrides:: toString in class AbstractCollection<Sentence>

Returns:: A string representing this text.

toString

public String toString(String separator)

Gives a concatenation of the sentences from this text. Between each sentence a given separator is inserted.

Parameters:: separator - The separator connecting two sentences
Returns:: A string representation of this text.

main

public static void main(String[] argv)

The main method contains a general class tester.

Parameters:: argv - One parameter may be indicated, containing the path to a file to be processed.

testSimilarity

public static boolean testSimilarity()

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

hultig.sumo Class Text

Text

Text

Text

Text

add

add

add

add

cutIfLessThan

readFile

saveFile

saveFile

toLowerCase

codify

similarity

codify

getCorpusIndex

getWord

getSentence

getSentences

getNumTokens

getVocab

removeDuplicates

randomDrop

shuffle

print

print

freq

prob

printVocabulary

toString

toString

main

testSimilarity

hultig.sumo
Class Text