Sentence

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

hultig.sumo
Class Sentence

java.lang.Object
  java.util.AbstractCollection<E>
      java.util.AbstractList<E>
          java.util.AbstractSequentialList<E>
              java.util.LinkedList<Word>
                  hultig.sumo.Sentence

All Implemented Interfaces:: Serializable, Cloneable, Comparable, Iterable<Word>, Collection<Word>, Deque<Word>, List<Word>, Queue<Word>

Direct Known Subclasses:: ChunkedSentence

public class Sentence
extends LinkedList<Word>
implements Comparable
extends LinkedList<Word>
implements Comparable

Represents a textual sentence using various schemes or interpretations. For instance, a sentence may be intrepreted as a sequence of characters or as a sequence of words, represented by a linked list of words. This class manages different kind of sentence representations.

University of Beira Interior (UBI)
Centre For Human Language Technology and Bioinformatics (HULTIG)

See Also:: Serialized Form

Field Summary
`int`	`cod` A sentence index, used in news clustering.
`String`	`label` This label defines a sentence meta-tag.
`static String`	`parentise` The set of text delimiters.
`static String`	`pontuacao` The set of punctuation marks.
`protected String`	`stx` Internal string representation of this sentence.

Fields inherited from class java.util.AbstractList
`modCount`

Constructor Summary
`Sentence()` Default constructor.
`Sentence(String s)` Creates a new sentence from a given string.

Method Summary
`void`	`addWord(Word w)` Append a new word to this sentence.
`void`	`codify(CorpusIndex dic)` Codifies this sentence according to a given previously processed dictionary.
`static void`	`codify(Sentence... vs)` Static method to codify a bunch of sentences.
`int`	`compareTo(Object other)`
`static int`	`countIntersectLinks(Sentence sa, Sentence sb)` Counts te number of link intersections existing between the two sentences.
`int`	`countMatch(String regex)`
`static int`	`countMatchNGram(int N, Sentence sa, Sentence sb)` Counts the number of exclusive n-gram matches, between two sentences.
`static double`	`countNormIntersectLinks(Sentence sa, Sentence sb)` Percentage of link intersections existing between the two sentences.
`int`	`countNotMatch(String regex)` Counts the number of words from this sentence that do not match a given regular expression.
`int`	`countNumWords()` Counts the number of words in this sentence.
`static int`	`ctMatchNGram(int N, Sentence sa, Sentence sb)` Counts the number of n-grams match between two sentences.
`static void`	`demoForWebPage()`
`double`	`dgauss(Sentence other)` A simple version of the gaussian similarity between twon sentences.
`double`	`dgauss(Sentence other, double p0, double r0, double sp0, double sr0)` The gaussian similarity between two sentences.
`double`	`distlex(Word w)` The minumum lexical distance of a word to any word in this sentence.
`double`	`dLinear(Sentence other)` The linear similarity metric between two sentences.
`double`	`dParabolic(Sentence other)` The parabolic sentence similarity metric.
`static double`	`dsBLEU(Sentence sa, Sentence sb)` Computes the BLEU metric between two sentences.
`double`	`dsEntropy(Sentence other)` The "entropy metric" for calculating the similarity between two sentences.
`double`	`dSin(Sentence other)` The trignometric function for calculating the similarity between two sentences.
`int`	`dsLevenshtein(Sentence other)` This method applies the Edit Distance (ED) metric to compare this sentence with another one.
`static double`	`dsNgram(int N, Sentence sa, Sentence sb)` Computes the simple n-gram overlap between two sentences, considering a maximum number of n-grams.
`static double`	`dsNgram(Sentence sa, Sentence sb)` Computes the simple n-gram overlap between two sentences, with 4 as the maximum n-gram counted.
`static double`	`dsuffixArrays(Sentence sa, Sentence sb)` A metric for calculating sentence proximity, based on suffix arrays comparisons of n-grams, as defined by Church and Yamamoto.
`static double`	`dsumo(int[] u, int[] v)` The "sumo metric" for calculating the similarity between two sentences.
`double`	`dsumo(Sentence other)` The "sumo metric" for calculating the similarity between two sentences.
`double`	`dsumoWSize(Sentence other)` A different version of the `sumo` function for calculating sentence similarity between two sentences.
`static void`	`ensureCodification(Sentence... sentences)` Ensures that a given set of sentences is codified, which means that their words have been marked with a word indexer (a `CorpusIndex` object).
`static boolean`	`equalArrays(int[] u, int[] v)` Verifies whether two arrays are equal.
`double`	`fracNumWords()` The proportion of effective words contained in this sentence.
`int[]`	`getCodes()` Gives the array of lexical codes representing this sentence.
`String`	`getTag(int index)` The POS tag, if defined, for a given word.
`String[]`	`getTags()` Gives the array of POS tags for that sentence, assuming it was already tagged.
`String`	`getWord(int index)` Gives the sentence word positioned at a given index.
`String[]`	`getWords()` Gives an array of strings, containing all the words in this sentence.
`int`	`indexOf(String s)` Gives the index of a string in this sentence.
`int`	`indexOf(String s, int from)` Gives the index of a string occurence within this sentence, starting the search from a given position.
`boolean`	`isCodefied()` Verifies whether this sentence has been marked with a CorpusIndex object.
`static boolean`	`isPunct(String s)` Tests if a given string is a punctuation mark.
`static boolean`	`isWord(String s)` Test if a given string is a word.
`int`	`length()` This sentence string length.
`static void`	`main(String[] args)` The main method contains a general class tester.
`static int`	`match(int[] vsub, int[] v)` Counts the number of occurrences of a sub-array inside another, presumably longer, array.
`Sentence`	`mutation(int n)` Produces a given number of random "mutations" in this sentence.
`void`	`print()` Outputs the string representing this sentence.
`void`	`print(int a, int b)` Outputs the words of this sentence, between two positions.
`void`	`println()` Outputs all words from this sentence, one word per line.
`static int[][]`	`readLinks(int[] va, int[] vb)` Returns the set of links between two sentences: A = {a1,a2,...an} where ak = (k1, k2) is an integer pair representing the link between word in position k1, in one sentence, and k2 the k2-th word in the other sentence.
`int[][]`	`readLinks(Sentence other)` Returns the set of links between two sentences: A = {a1,a2,...an} where ak = (k1, k2) is an integer pair representing the link between word in position k1, in one sentence, and k2 the k2-th word in the other sentence.
`void`	`reload(Sentence s)` Recreate this sentence from another one.
`void`	`set(String s)` Recreate this sentence from a given string.
`void`	`setMetric(String smetric)` Defines which should be the default similarity function to be used in the sentence similarity computation.
`double`	`similarity(Sentence other)` Compute the similarity metric between two sentences.
`double`	`similarity(Sentence other, String metric)` Calculates the similarity between two sentences using a given similarity function.
`Sentence[]`	`splitPunct()` Split a sentence based on the punctuations found.
`int[]`	`subcodes(int start, int end)` Gives the array of sub-codes corresponding to a sub-sentence of this sentences.
`Sentence`	`subs(int a, int b)` Gives a sub-sentence from this sentence, between positions a and b, which should be valid.
`static void`	`testaMetricas(String s1, String s2)`
`void`	`toLowerCase()` Converts every word to lower case and transforms their `CorpusIndex` codes to -1.
`void`	`toLowerCase(CorpusIndex dic)` Converts every word to lower case and redefines each word's lexical code, basesd on a supplied dictionary.
`String`	`toMWUString()` Transform this sentence into a kind of a multi-word-unit (MWU) expression.
`String`	`toString()` The overriding of the toString() method.
`String`	`toStringPOS()` A toString() type method giving each word joined with its respective part-of-speech tag
`static void`	`x201102012359()`
`static void`	`x201102281055()` Correcções na sequência dos testes exaustivos realizados pelo Steven Burrows.

Methods inherited from class java.util.LinkedList
`add, add, addAll, addAll, addFirst, addLast, clear, clone, contains, descendingIterator, element, get, getFirst, getLast, indexOf, lastIndexOf, listIterator, offer, offerFirst, offerLast, peek, peekFirst, peekLast, poll, pollFirst, pollLast, pop, push, remove, remove, remove, removeFirst, removeFirstOccurrence, removeLast, removeLastOccurrence, set, size, toArray, toArray`

Methods inherited from class java.util.AbstractSequentialList
`iterator`

Methods inherited from class java.util.AbstractList
`equals, hashCode, listIterator, removeRange, subList`

Methods inherited from class java.util.AbstractCollection
`containsAll, isEmpty, removeAll, retainAll`

Methods inherited from class java.lang.Object
`finalize, getClass, notify, notifyAll, wait, wait, wait`

Methods inherited from interface java.util.List
`containsAll, equals, hashCode, isEmpty, iterator, listIterator, removeAll, retainAll, subList`

Methods inherited from interface java.util.Deque
`iterator`

Field Detail

pontuacao

public static String pontuacao

The set of punctuation marks.

parentise

public static String parentise

The set of text delimiters.

stx

protected String stx

Internal string representation of this sentence.

label

public String label

This label defines a sentence meta-tag.

cod

public int cod

A sentence index, used in news clustering.

Since:: 2008-06-05

Constructor Detail

Sentence

public Sentence()

Default constructor.

Sentence

public Sentence(String s)

Creates a new sentence from a given string.

Parameters:: s - - The string containing a sentence.

Method Detail

compareTo

public int compareTo(Object other)

Specified by:: compareTo in interface Comparable

reload

public void reload(Sentence s)

Recreate this sentence from another one.

Parameters:: s - The other sentence.

addWord

public void addWord(Word w)

Append a new word to this sentence.

Parameters:: w - The word to be appended

set

public void set(String s)

Recreate this sentence from a given string.

Parameters:: s - The indicated string.

codify

public void codify(CorpusIndex dic)

Codifies this sentence according to a given previously processed dictionary.

Parameters:: dic - The indicated dictionary.

codify

public static void codify(Sentence... vs)

Static method to codify a bunch of sentences.

Parameters:: vs - Sentence[]

length

public int length()

This sentence string length.

Returns:: The length value.

getWord

public String getWord(int index)

Gives the sentence word positioned at a given index.

Parameters:: index - The index to read from.
Returns:: The word read in the string form.

getWords

public String[] getWords()

Gives an array of strings, containing all the words in this sentence.

Returns:: The array of words.

getTag

public String getTag(int index)

The POS tag, if defined, for a given word.

Parameters:: index - The word position in the sentence.
Returns:: The POS tag read or else the null value.

getCodes

public int[] getCodes()

Gives the array of lexical codes representing this sentence. It assumes that the sentence was already codified.

Returns:: The array of lexical codes.

getTags

public String[] getTags()

Gives the array of POS tags for that sentence, assuming it was already tagged.

Returns:: The array of tags.

isCodefied

public boolean isCodefied()

Verifies whether this sentence has been marked with a CorpusIndex object.

Returns:: The true value on success.

isPunct

public static boolean isPunct(String s)

Tests if a given string is a punctuation mark.

Parameters:: s - The string to be tested.
Returns:: The test result.

isWord

public static boolean isWord(String s)

Test if a given string is a word.

Parameters:: s - The string to be tested
Returns:: True if the string is a word.

countNumWords

public int countNumWords()

Counts the number of words in this sentence.

Returns:: The number of effective words found.

indexOf

public int indexOf(String s)

Gives the index of a string in this sentence. The input string will be compared with each sentence word and the position of the first occurence will be given.

Parameters:: s - The string to be scaned in this sentence.
Returns:: The index found, or else the -1 value will be returned.

indexOf

public int indexOf(String s,
                   int from)

Gives the index of a string occurence within this sentence, starting the search from a given position.

Parameters:: s - The string to be scaned in this string.; from - The starting index.
Returns:: The index found, or else the -1 value will be returned.
See Also:: indexOf(String s)

subs

public Sentence subs(int a,
                     int b)

Gives a sub-sentence from this sentence, between positions a and b, which should be valid. We can have a < b or b < a. It is only required that 0 < a, b, < "sentence length".

Parameters:: a - One index.; b - The other index.
Returns:: The sub-sentence

splitPunct

public Sentence[] splitPunct()

Split a sentence based on the punctuations found.

Returns:: The array of sentences obtained.

dsLevenshtein

public int dsLevenshtein(Sentence other)

This method applies the Edit Distance (ED) metric to compare this sentence with another one. The basic comparition unit is the word, not the character as in conventional ED.

Parameters:: other - The other sentence.
Returns:: The calculated distance.

subcodes

public int[] subcodes(int start,
                      int end)

Gives the array of sub-codes corresponding to a sub-sentence of this sentences. It assumes that the sentence was already some how been codified, for example through a dictionary ("CorpusIndex") previously computed.

Parameters:: start -; end -
Returns:: The sub-array of codes.

match

public static int match(int[] vsub,
                        int[] v)

Counts the number of occurrences of a sub-array inside another, presumably longer, array. This method is used for simple n-gram match counting.

Parameters:: vsub - The sub-array.; v - The longer array.
Returns:: The number of occurrences.

equalArrays

public static boolean equalArrays(int[] u,
                                  int[] v)

Verifies whether two arrays are equal.

Parameters:: u - The first array.; v - The second array.
Returns:: The test result

ctMatchNGram

public static int ctMatchNGram(int N,
                               Sentence sa,
                               Sentence sb)

Counts the number of n-grams match between two sentences.

Parameters:: N - The n-gram size.; sa - The first sentence.; sb - The second sentence.
Returns:: The number of n-gram matches.

countMatchNGram

public static int countMatchNGram(int N,
                                  Sentence sa,
                                  Sentence sb)

Counts the number of exclusive n-gram matches, between two sentences.

Parameters:: N - The n-gram size.; sa - The first sentence.; sb - The other sentence.
Returns:: The number of matches.

readLinks

public int[][] readLinks(Sentence other)

Returns the set of links between two sentences: A = {a1,a2,...an} where ak = (k1, k2) is an integer pair representing the link between word in position k1, in one sentence, and k2 the k2-th word in the other sentence.

Parameters:: other - The other sentence.
Returns:: An array of integer pairs representing the links.
Since:: 2007-03-23

readLinks

public static int[][] readLinks(int[] va,
                                int[] vb)

Parameters:: va - The first sentence array.; vb - The second sentence array.
Returns:: An array of integer pairs representing the links.
Since:: 2007-03-23

countMatch

public int countMatch(String regex)

countNotMatch

public int countNotMatch(String regex)

Counts the number of words from this sentence that do not match a given regular expression.

Parameters:: regex - The indicated regular expression.
Returns:: The number of condition matches.

countIntersectLinks

public static int countIntersectLinks(Sentence sa,
                                      Sentence sb)

Counts te number of link intersections existing between the two sentences. This method is fundamental for making the runtime decision of choosing the alignment algorithm: Smith Waterman or Needleman Wunsch. This id better explained in:

Cordeiro, J.P., Dias, G.Cleuziou G. (2007). Biology Based Alignments of Paraphrases for Sentence Compression. In Proceedings of the Workshop on Textual Entailment and Paraphrasing (ACL-PASCAL / ACL2007). Prague, Czech Republic. [link].

Parameters:: sa - The first sentence.; sb - The second sentence.
Returns:: The number of link intersections.
Since:: 2007-03-23

countNormIntersectLinks

public static double countNormIntersectLinks(Sentence sa,
                                             Sentence sb)

Percentage of link intersections existing between the two sentences. This method is related with the "countIntersectLinks(Sentence sa, Sentence sb)" method. The only difference is normalization.

Parameters:: sa - The first sentence; sb - The second sentence
Returns:: A value in the [0,1] interval.
Since:: 2007-03-23

dsBLEU

public static double dsBLEU(Sentence sa,
                            Sentence sb)

Computes the BLEU metric between two sentences. A sentence proximity value.

Parameters:: sa - The first sentence.; sb - The second sentence.
Returns:: A value in the [0,1] interval.

dsNgram

public static double dsNgram(Sentence sa,
                             Sentence sb)

Computes the simple n-gram overlap between two sentences, with 4 as the maximum n-gram counted.

Parameters:: sa - The first sentence.; sb - The second sentence.
Returns:: A value in the [0,1] interval.

dsNgram

public static double dsNgram(int N,
                             Sentence sa,
                             Sentence sb)

Computes the simple n-gram overlap between two sentences, considering a maximum number of n-grams.

Parameters:: N - The maximum number of n-grams.; sa - The first sentence.; sb - The second sentence.
Returns:: A value in the [0,1] interval.

dsuffixArrays

public static double dsuffixArrays(Sentence sa,
                                   Sentence sb)

A metric for calculating sentence proximity, based on suffix arrays comparisons of n-grams, as defined by Church and Yamamoto.

Parameters:: sa - The first sentence.; sb - The second sentence.
Returns:: A value in the [0,1] interval.
Since:: 2006-03-28

dsumo

public double dsumo(Sentence other)

The "sumo metric" for calculating the similarity between two sentences. As presented in:

Cordeiro, J.P., Dias, G. Brazdil, P. (2007). Learning Paraphrases from WNS Corpora. 20th International FLAIRS Conference. AAAI Press. Key West, Florida, USA. [link]

Parameters:: other - The other sentence to compare with.
Returns:: A value in the [0,1] interval.
Since:: 2006-03-25

dsumo

public static double dsumo(int[] u,
                           int[] v)

The "sumo metric" for calculating the similarity between two sentences. represented by their arrays of codes. As presented in:

Cordeiro, J.P., Dias, G. Brazdil, P. (2007). Learning Paraphrases from WNS Corpora. 20th International FLAIRS Conference. AAAI Press. Key West, Florida, USA. [link]

Parameters:: u - The first array of codes.; v - The second array of codes.
Returns:: The similarity value in the [0,1] interval.
Since:: 2006-05-04

dsumoWSize

public double dsumoWSize(Sentence other)

A different version of the sumo function for calculating sentence similarity between two sentences. The main difference consists in counting differently the lexical exclusive links between the two sentences. The "weight" of each link directly depends from the connected word sizes.

Parameters:: other - The other sentence to compare with.
Returns:: double A value in the [0,1] interval.

dsEntropy

public double dsEntropy(Sentence other)

The "entropy metric" for calculating the similarity between two sentences.

Cordeiro, J.P., Dias, G. Cleuziou G. Brazdil P. (2007). New Functions for Unsupervised Asymmetrical Paraphrase Detection. In Journal of Software. Volume:2, Issue:4, Page(s): 12-23. Academy Publisher. Finland. ISSN: 1796-217X. October 2007. [link]

Date: 2007-06-18

Parameters:: other - The other sentence to compare with.
Returns:: A value in the [0,1] interval.

dgauss

public double dgauss(Sentence other)

A simple version of the gaussian similarity between twon sentences. In this particular case the gaussian parameters are as follows: a=1, b=0.5, and c=0.3.

Parameters:: other - The other sentence.
Returns:: A value in the [y0,1] interval, where y0 corresponds to x = 0.0 (no overlapping), which means that y0 = exp{-0.5^2/(2*0.3^2)} = 0.29435

dgauss

public double dgauss(Sentence other,
                     double p0,
                     double r0,
                     double sp0,
                     double sr0)

The gaussian similarity between two sentences. Like the gaussian function family, it depends from four parameters, which here have the meaning listed bellow. This function was also presented in the following article:

Parameters:: other - The other sentence.; p0 - The expected precision of sentences token match.; r0 - The expected recall of sentences token match.; sp0 - The expected precision variance; sr0 - The expected recall variance.
Returns:: A value in the [0,1] interval.

dParabolic

public double dParabolic(Sentence other)

The parabolic sentence similarity metric. As presented in the following article:

Parameters:: other - The other sentence.
Returns:: A value in the [0,1] interval.
Since:: 2007-06-18

dLinear

public double dLinear(Sentence other)

The linear similarity metric between two sentences. It is based on the triangular function in the [0,1] interval taking as arguments the precision and recall of sentence token overlapping between the two sentences.

Parameters:: other - The other sentence
Returns:: A value in the [0,1] interval.
Since:: 2007-06-18

dSin

public double dSin(Sentence other)

The trignometric function for calculating the similarity between two sentences. It is based on the sin function. Presented in the following article:

Parameters:: other -
Returns:: A value in the [0,1] interval.

ensureCodification

public static void ensureCodification(Sentence... sentences)

Ensures that a given set of sentences is codified, which means that their words have been marked with a word indexer (a CorpusIndex object). If not, the set of sentences will be marked with a new and specific word indexer, constructed only from the set of sentences receive as parameter.

Parameters:: sentences - The set

distlex

public double distlex(Word w)

The minumum lexical distance of a word to any word in this sentence.

Parameters:: w - The input word.
Returns:: The minimum distance.

fracNumWords

public double fracNumWords()

The proportion of effective words contained in this sentence.

Returns:: A value in the [0,1] interval.

setMetric

public void setMetric(String smetric)

Defines which should be the default similarity function to be used in the sentence similarity computation.

Parameters:: smetric - Contains the name of the similarity function. The possible values are: ngram, xgram, bleu, edit, entropy,or sumo.
See Also:: The defined metric codes.

similarity

public double similarity(Sentence other)

Compute the similarity metric between two sentences. The "sumo metric" is the default similarity function used.

Parameters:: other - The other sentence.
Returns:: The similarity value [0,1], according to some specified metric.
See Also:: The defined metric codes.

similarity

public double similarity(Sentence other,
                         String metric)

Calculates the similarity between two sentences using a given similarity function.

Parameters:: other - The other sentence.; metric - The name of the similarity function.
Returns:: A value in the [0,1] interval.
Since:: 2009-11-17

print

public void print(int a,
                  int b)

Outputs the words of this sentence, between two positions.

Parameters:: a - The fist position.; b - The second position.

print

public void print()

Outputs the string representing this sentence.

println

public void println()

Outputs all words from this sentence, one word per line.

toLowerCase

public void toLowerCase()

Converts every word to lower case and transforms their CorpusIndex codes to -1. Thus, any lexical codification will be eliminated.

toLowerCase

public void toLowerCase(CorpusIndex dic)

Converts every word to lower case and redefines each word's lexical code, basesd on a supplied dictionary.

Parameters:: dic - The dictionary.

toString

public String toString()

The overriding of the toString() method.

Overrides:: toString in class AbstractCollection<Word>

Returns:: A string representing this sentence.

toStringPOS

public String toStringPOS()

A toString() type method giving each word joined with its respective part-of-speech tag

Returns:: String

toMWUString

public String toMWUString()

Transform this sentence into a kind of a multi-word-unit (MWU) expression. Each word will be connected to their neighbors through underscores. For example, the sentence "The big cat" will give rise to "The_big_cat".

Returns:: The multi-word-unit.
Since:: 2010-02-12 (Created for the work with Gintare).

mutation

public Sentence mutation(int n)

Produces a given number of random "mutations" in this sentence. This method was used in several early paraphrase detection experiments. A "mutation" consists in transforming a sentence word into a constant of mutation ("XMUT") token.

Parameters:: n - The maximum and likely number of mutations.
Returns:: A mutated sentence.

x201102012359

public static void x201102012359()

x201102281055

public static void x201102281055()

Correcções na sequência dos testes exaustivos realizados pelo Steven Burrows.

testaMetricas

public static void testaMetricas(String s1,
                                 String s2)

demoForWebPage

public static void demoForWebPage()

main

public static void main(String[] args)

The main method contains a general class tester.

Parameters:: args -

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

hultig.sumo Class Sentence

pontuacao

parentise

stx

label

cod

Sentence

Sentence

compareTo

reload

addWord

set

codify

codify

length

getWord

getWords

getTag

getCodes

getTags

isCodefied

isPunct

isWord

countNumWords

indexOf

indexOf

subs

splitPunct

dsLevenshtein

subcodes

match

equalArrays

ctMatchNGram

countMatchNGram

readLinks

readLinks

countMatch

countNotMatch

countIntersectLinks

countNormIntersectLinks

dsBLEU

dsNgram

dsNgram

dsuffixArrays

dsumo

dsumo

dsumoWSize

dsEntropy

dgauss

dgauss

dParabolic

dLinear

dSin

ensureCodification

distlex

fracNumWords

setMetric

similarity

similarity

print

print

println

toLowerCase

toLowerCase

toString

toStringPOS

toMWUString

mutation

x201102012359

x201102281055

testaMetricas

demoForWebPage

main

hultig.sumo
Class Sentence