hultig.sumo
Class ChunkedSentence

java.lang.Object
  extended by java.util.AbstractCollection<E>
      extended by java.util.AbstractList<E>
          extended by java.util.AbstractSequentialList<E>
              extended by java.util.LinkedList<Word>
                  extended by hultig.sumo.Sentence
                      extended by hultig.sumo.ChunkedSentence
All Implemented Interfaces:
Serializable, Cloneable, Comparable, Iterable<Word>, Collection<Word>, Deque<Word>, List<Word>, Queue<Word>

public class ChunkedSentence
extends Sentence

A specialization of the Sentence class, for handling shallow parsed sentences in a more efficient way. It uses chunk marks (ChunkMark) to represent the sequence of chunk boundaries.

University of Beira Interior (UBI)
Centre For Human Language Technology and Bioinformatics (HULTIG)

See Also:
Serialized Form

Field Summary
 
Fields inherited from class hultig.sumo.Sentence
cod, label, parentise, pontuacao, stx
 
Fields inherited from class java.util.AbstractList
modCount
 
Constructor Summary
ChunkedSentence()
          The default constructor, which invokes several default settings, including the definition of chunk values.
ChunkedSentence(Sentence s, OpenNLPKit model)
          This constructor receives a sentence and a language model, and creates an instance of a chunked sentence.
ChunkedSentence(String s, OpenNLPKit model)
          This constructor receives a string and a language model, and creates a chunked sentence.
 
Method Summary
 Chunk getChunk(int index)
          Gets a string with the k-th chunk, from this sentence chunk sequence.
 ChunkMark getChunkMark(int index)
          Gives the chunk mark (boundaries and tag), for the chunk at position index, in the sequence of sentence chunks.
 ChunkMark getChunkOnPosition(int index)
          Gives the chunk mark relative to the word at position index, in this sentence.
 Chunk[] getChunks()
          Gets an array of strings containing the complete sequence of chunks, from this sentence, one chunk per array position.
 int getNumChunks()
          Gives the number of chunks contained in this sentence.
 int getNumChunks(String postag)
          Counts the number of chunks of a certain kind (tag).
 int getNumWords()
          Gives the number of effective words contained in this sentence.
 String getPOStrFixed()
           
 String getSPOSig()
          The same as getSPOSig(char chconnect) with the connection character being equal to the default of a blank space.
 String getSPOSig(char chconnect)
          Gives a string with the sequence of part-of-speech tags, corresponding to to the sequence of words in the sentence.
 String[] getVPOSig()
          Gives the array of part-of-speech tags, corresponding to the sequence of words in the sentence.
 String getWordChunkMark(int index)
          Gives the chunk mark for a word at position index, identifying first to which chunk does the word belong.
 double lexicoSyntacticEntailmentMetric(ChunkedSentence hypot)
          This function was designed to compute a likelihood value for the "lexico-syntactic entailment" between this sentence (thesis) and the entailed sentence - the other sentence (hypothesis).
static void main(String[] args)
          Generally exemplifies the operative features of this class.
 void printArrayWords()
          This method is a default shortcut for printArrayWords(java.lang.String), with label = null.
 void printArrayWords(String label)
          Outputs the sequence of words in this shallow parsed sentence with their corresponding lexico-syntactic codes.
 ChunkedSentence subList(int fromIndex, int toIndex)
          Gives a subsequence of this sentence, in the form of a list of words.
 String toPOString()
          Gives a string with only the part-of-speech tags.
 String toStringChunk()
          Gives a shallow parsed representation of this chunked sentence.
 String toStringRegex()
          Gives another format of a shallow parsed representation of this sentence, in a format suitable for regular expression matching.
 String toStringRegexPOS()
          A toString() method type that gives a string representation of this chunked sentence, where each word is printed followed by its part-of-speech tag, as shown in the next example: the/dt lazy/jj fox/nn jumped/vbd over/in the/dt fence/nn (26, April 2009, 10:47)
 String toStringRegexPOSCHK()
          This method is similar to toStringRegexPOS(), differing only in the fact that the chunk tag is also included in each word printing, after the part-of-speech tag.
 
Methods inherited from class hultig.sumo.Sentence
addWord, codify, codify, compareTo, countIntersectLinks, countMatch, countMatchNGram, countNormIntersectLinks, countNotMatch, countNumWords, ctMatchNGram, demoForWebPage, dgauss, dgauss, distlex, dLinear, dParabolic, dsBLEU, dsEntropy, dSin, dsLevenshtein, dsNgram, dsNgram, dsuffixArrays, dsumo, dsumo, dsumoWSize, ensureCodification, equalArrays, fracNumWords, getCodes, getTag, getTags, getWord, getWords, indexOf, indexOf, isCodefied, isPunct, isWord, length, match, mutation, print, print, println, readLinks, readLinks, reload, set, setMetric, similarity, similarity, splitPunct, subcodes, subs, testaMetricas, toLowerCase, toLowerCase, toMWUString, toString, toStringPOS, x201102012359, x201102281055
 
Methods inherited from class java.util.LinkedList
add, add, addAll, addAll, addFirst, addLast, clear, clone, contains, descendingIterator, element, get, getFirst, getLast, indexOf, lastIndexOf, listIterator, offer, offerFirst, offerLast, peek, peekFirst, peekLast, poll, pollFirst, pollLast, pop, push, remove, remove, remove, removeFirst, removeFirstOccurrence, removeLast, removeLastOccurrence, set, size, toArray, toArray
 
Methods inherited from class java.util.AbstractSequentialList
iterator
 
Methods inherited from class java.util.AbstractList
equals, hashCode, listIterator, removeRange
 
Methods inherited from class java.util.AbstractCollection
containsAll, isEmpty, removeAll, retainAll
 
Methods inherited from class java.lang.Object
finalize, getClass, notify, notifyAll, wait, wait, wait
 
Methods inherited from interface java.util.List
containsAll, equals, hashCode, isEmpty, iterator, listIterator, removeAll, retainAll
 
Methods inherited from interface java.util.Deque
iterator
 

Constructor Detail

ChunkedSentence

public ChunkedSentence()
The default constructor, which invokes several default settings, including the definition of chunk values.


ChunkedSentence

public ChunkedSentence(String s,
                       OpenNLPKit model)
This constructor receives a string and a language model, and creates a chunked sentence. The shallow parser is invoked from the language model object (model).

Parameters:
s - A string representing a textual sentence.
model - The language model which should had already be adequately loaded/configured.

ChunkedSentence

public ChunkedSentence(Sentence s,
                       OpenNLPKit model)
This constructor receives a sentence and a language model, and creates an instance of a chunked sentence. The shallow parser is invoked from the language model object (model).

Parameters:
s - The sentence for shallow parsing.
model - The language model.
Method Detail

getNumWords

public int getNumWords()
Gives the number of effective words contained in this sentence.

Returns:
The number of words.

getNumChunks

public int getNumChunks()
Gives the number of chunks contained in this sentence.

Returns:
The number of chunks.

getNumChunks

public int getNumChunks(String postag)
Counts the number of chunks of a certain kind (tag).

Parameters:
postag - The chunk tag to be counted, for example "NP", "VP".
Returns:
The number of chunks matching postag.

getChunk

public Chunk getChunk(int index)
Gets a string with the k-th chunk, from this sentence chunk sequence.

Parameters:
index - The chunk index.
Returns:
String The string of the k-th chunk in the "usual" format, as for example: [NP the/DT Pet/NNP passport/NN ]. On error, null will be returned.

getChunks

public Chunk[] getChunks()
Gets an array of strings containing the complete sequence of chunks, from this sentence, one chunk per array position.

Returns:
String[] The sequence of chunks or null on error.

getChunkMark

public ChunkMark getChunkMark(int index)
Gives the chunk mark (boundaries and tag), for the chunk at position index, in the sequence of sentence chunks.

Parameters:
index - The chunk index.
Returns:
ChunkMark The chunk mark.

getChunkOnPosition

public ChunkMark getChunkOnPosition(int index)
Gives the chunk mark relative to the word at position index, in this sentence.

Parameters:
index - A valid index of a sentence word. It must be greater than zero and less than the number of words in the sentence.
Returns:
Whether the corresponding chunk mark or null, on erroneous cases.

getWordChunkMark

public String getWordChunkMark(int index)
Gives the chunk mark for a word at position index, identifying first to which chunk does the word belong.

Parameters:
index - The word sequential index, in the sentence.
Returns:
The chunk tag (e.g. NP, VP), or null on index out of bounds.

getVPOSig

public String[] getVPOSig()
Gives the array of part-of-speech tags, corresponding to the sequence of words in the sentence.

Returns:
The array of part-of-speech tags.

getSPOSig

public String getSPOSig(char chconnect)
Gives a string with the sequence of part-of-speech tags, corresponding to to the sequence of words in the sentence.

Parameters:
chconnect - The connection character, between two tags, usually a blank space.
Returns:
The string with part-of-speech sequence, or null on error. For example: "NP VP PP NP VP".

getSPOSig

public String getSPOSig()
The same as getSPOSig(char chconnect) with the connection character being equal to the default of a blank space.

Returns:

toPOString

public String toPOString()
Gives a string with only the part-of-speech tags.

Returns:
The POS string.

toStringChunk

public String toStringChunk()
Gives a shallow parsed representation of this chunked sentence. The representation follows a conventional format: CHK1 CHK2 ... CHn, where CHKi represents the i-th sentence chunk, with the following structure: CHKi = [CT W1/T1, W2/T2, ..., Wn/Tn], where CT represents the chunk tag, and Wj and Tj the j-th chunk word and POS tag. For example:
    [NP The/DT lazy/JJ fox/NN] [VP jumped/VBD] [PP over/IN] [NP the/DT fence/NN]

Returns:
The string representing the shallow parsed sentence.

toStringRegex

public String toStringRegex()
Gives another format of a shallow parsed representation of this sentence, in a format suitable for regular expression matching. The idea was to be able to apply sentence simplification rules expressed expressed through regular expressions (13, February 2009, 11:57). This format is exemplified in the following example:
    np:<the/dt lazy/jj fox/nn>:np  vp:<jumped/vbd>:vp  pp:<over/in>:pp  np:<the/dt fence/nn>:np

Returns:
The string representing the shallow parsed sentence.

toStringRegexPOS

public String toStringRegexPOS()
A toString() method type that gives a string representation of this chunked sentence, where each word is printed followed by its part-of-speech tag, as shown in the next example:
    the/dt lazy/jj fox/nn jumped/vbd over/in the/dt fence/nn
(26, April 2009, 10:47)

Returns:
A string representation of this chunked sentence.

toStringRegexPOSCHK

public String toStringRegexPOSCHK()
This method is similar to toStringRegexPOS(), differing only in the fact that the chunk tag is also included in each word printing, after the part-of-speech tag. For example:
    the/dt/np lazy/jj/np fox/nn/np jumped/vbd/vp over/in/pp the/dt/np fence/nn/np
(27, April 2009, 20:10)

Returns:
A string representation of this chunked sentence.

lexicoSyntacticEntailmentMetric

public double lexicoSyntacticEntailmentMetric(ChunkedSentence hypot)
This function was designed to compute a likelihood value for the "lexico-syntactic entailment" between this sentence (thesis) and the entailed sentence - the other sentence (hypothesis). We say that sentence T entails sentence H if we can infer/conclude H by knowing T. This metric was created to work with data from the RTE collections. The calculations are based on lexical and syntactical (shallow parsed sentence) features.

Parameters:
hypot - The sentence that represents the hypothesis.
Returns:
A real value in the [0,1] interval.

printArrayWords

public void printArrayWords()
This method is a default shortcut for printArrayWords(java.lang.String), with label = null.


printArrayWords

public void printArrayWords(String label)
Outputs the sequence of words in this shallow parsed sentence with their corresponding lexico-syntactic codes.

Parameters:
label - A string to be printed before the whole sequence.

subList

public ChunkedSentence subList(int fromIndex,
                               int toIndex)
Gives a subsequence of this sentence, in the form of a list of words.

Specified by:
subList in interface List<Word>
Overrides:
subList in class AbstractList<Word>
Parameters:
fromIndex - The inclusive starting index.
toIndex - The inclusive ending index.
Returns:
A sublist representing a subsequence of words, from this sequence.

getPOStrFixed

public String getPOStrFixed()

main

public static void main(String[] args)
Generally exemplifies the operative features of this class. In order to run the tests contained in this method, a language model (OpenNLP object) must be previously set.

Parameters:
args - The are no arguments expected.