hultig.sumo
Class Sentence

java.lang.Object
  extended by java.util.AbstractCollection<E>
      extended by java.util.AbstractList<E>
          extended by java.util.AbstractSequentialList<E>
              extended by java.util.LinkedList<Word>
                  extended by hultig.sumo.Sentence
All Implemented Interfaces:
Serializable, Cloneable, Comparable, Iterable<Word>, Collection<Word>, Deque<Word>, List<Word>, Queue<Word>
Direct Known Subclasses:
ChunkedSentence

public class Sentence
extends LinkedList<Word>
implements Comparable

Represents a textual sentence using various schemes or interpretations. For instance, a sentence may be intrepreted as a sequence of characters or as a sequence of words, represented by a linked list of words. This class manages different kind of sentence representations.

University of Beira Interior (UBI)
Centre For Human Language Technology and Bioinformatics (HULTIG)

See Also:
Serialized Form

Field Summary
 int cod
          A sentence index, used in news clustering.
 String label
          This label defines a sentence meta-tag.
static String parentise
          The set of text delimiters.
static String pontuacao
          The set of punctuation marks.
protected  String stx
          Internal string representation of this sentence.
 
Fields inherited from class java.util.AbstractList
modCount
 
Constructor Summary
Sentence()
          Default constructor.
Sentence(String s)
          Creates a new sentence from a given string.
 
Method Summary
 void addWord(Word w)
          Append a new word to this sentence.
 void codify(CorpusIndex dic)
          Codifies this sentence according to a given previously processed dictionary.
static void codify(Sentence... vs)
          Static method to codify a bunch of sentences.
 int compareTo(Object other)
           
static int countIntersectLinks(Sentence sa, Sentence sb)
          Counts te number of link intersections existing between the two sentences.
 int countMatch(String regex)
           
static int countMatchNGram(int N, Sentence sa, Sentence sb)
          Counts the number of exclusive n-gram matches, between two sentences.
static double countNormIntersectLinks(Sentence sa, Sentence sb)
          Percentage of link intersections existing between the two sentences.
 int countNotMatch(String regex)
          Counts the number of words from this sentence that do not match a given regular expression.
 int countNumWords()
          Counts the number of words in this sentence.
static int ctMatchNGram(int N, Sentence sa, Sentence sb)
          Counts the number of n-grams match between two sentences.
static void demoForWebPage()
           
 double dgauss(Sentence other)
          A simple version of the gaussian similarity between twon sentences.
 double dgauss(Sentence other, double p0, double r0, double sp0, double sr0)
          The gaussian similarity between two sentences.
 double distlex(Word w)
          The minumum lexical distance of a word to any word in this sentence.
 double dLinear(Sentence other)
          The linear similarity metric between two sentences.
 double dParabolic(Sentence other)
          The parabolic sentence similarity metric.
static double dsBLEU(Sentence sa, Sentence sb)
          Computes the BLEU metric between two sentences.
 double dsEntropy(Sentence other)
          The "entropy metric" for calculating the similarity between two sentences.
 double dSin(Sentence other)
          The trignometric function for calculating the similarity between two sentences.
 int dsLevenshtein(Sentence other)
          This method applies the Edit Distance (ED) metric to compare this sentence with another one.
static double dsNgram(int N, Sentence sa, Sentence sb)
          Computes the simple n-gram overlap between two sentences, considering a maximum number of n-grams.
static double dsNgram(Sentence sa, Sentence sb)
          Computes the simple n-gram overlap between two sentences, with 4 as the maximum n-gram counted.
static double dsuffixArrays(Sentence sa, Sentence sb)
          A metric for calculating sentence proximity, based on suffix arrays comparisons of n-grams, as defined by Church and Yamamoto.
static double dsumo(int[] u, int[] v)
          The "sumo metric" for calculating the similarity between two sentences.
 double dsumo(Sentence other)
          The "sumo metric" for calculating the similarity between two sentences.
 double dsumoWSize(Sentence other)
          A different version of the sumo function for calculating sentence similarity between two sentences.
static void ensureCodification(Sentence... sentences)
          Ensures that a given set of sentences is codified, which means that their words have been marked with a word indexer (a CorpusIndex object).
static boolean equalArrays(int[] u, int[] v)
          Verifies whether two arrays are equal.
 double fracNumWords()
          The proportion of effective words contained in this sentence.
 int[] getCodes()
          Gives the array of lexical codes representing this sentence.
 String getTag(int index)
          The POS tag, if defined, for a given word.
 String[] getTags()
          Gives the array of POS tags for that sentence, assuming it was already tagged.
 String getWord(int index)
          Gives the sentence word positioned at a given index.
 String[] getWords()
          Gives an array of strings, containing all the words in this sentence.
 int indexOf(String s)
          Gives the index of a string in this sentence.
 int indexOf(String s, int from)
          Gives the index of a string occurence within this sentence, starting the search from a given position.
 boolean isCodefied()
          Verifies whether this sentence has been marked with a CorpusIndex object.
static boolean isPunct(String s)
          Tests if a given string is a punctuation mark.
static boolean isWord(String s)
          Test if a given string is a word.
 int length()
          This sentence string length.
static void main(String[] args)
          The main method contains a general class tester.
static int match(int[] vsub, int[] v)
          Counts the number of occurrences of a sub-array inside another, presumably longer, array.
 Sentence mutation(int n)
          Produces a given number of random "mutations" in this sentence.
 void print()
          Outputs the string representing this sentence.
 void print(int a, int b)
          Outputs the words of this sentence, between two positions.
 void println()
          Outputs all words from this sentence, one word per line.
static int[][] readLinks(int[] va, int[] vb)
          Returns the set of links between two sentences: A = {a1,a2,...an} where ak = (k1, k2) is an integer pair representing the link between word in position k1, in one sentence, and k2 the k2-th word in the other sentence.
 int[][] readLinks(Sentence other)
          Returns the set of links between two sentences: A = {a1,a2,...an} where ak = (k1, k2) is an integer pair representing the link between word in position k1, in one sentence, and k2 the k2-th word in the other sentence.
 void reload(Sentence s)
          Recreate this sentence from another one.
 void set(String s)
          Recreate this sentence from a given string.
 void setMetric(String smetric)
          Defines which should be the default similarity function to be used in the sentence similarity computation.
 double similarity(Sentence other)
          Compute the similarity metric between two sentences.
 double similarity(Sentence other, String metric)
          Calculates the similarity between two sentences using a given similarity function.
 Sentence[] splitPunct()
          Split a sentence based on the punctuations found.
 int[] subcodes(int start, int end)
          Gives the array of sub-codes corresponding to a sub-sentence of this sentences.
 Sentence subs(int a, int b)
          Gives a sub-sentence from this sentence, between positions a and b, which should be valid.
static void testaMetricas(String s1, String s2)
           
 void toLowerCase()
          Converts every word to lower case and transforms their CorpusIndex codes to -1.
 void toLowerCase(CorpusIndex dic)
          Converts every word to lower case and redefines each word's lexical code, basesd on a supplied dictionary.
 String toMWUString()
          Transform this sentence into a kind of a multi-word-unit (MWU) expression.
 String toString()
          The overriding of the toString() method.
 String toStringPOS()
          A toString() type method giving each word joined with its respective part-of-speech tag
static void x201102012359()
           
static void x201102281055()
          Correcções na sequência dos testes exaustivos realizados pelo Steven Burrows.
 
Methods inherited from class java.util.LinkedList
add, add, addAll, addAll, addFirst, addLast, clear, clone, contains, descendingIterator, element, get, getFirst, getLast, indexOf, lastIndexOf, listIterator, offer, offerFirst, offerLast, peek, peekFirst, peekLast, poll, pollFirst, pollLast, pop, push, remove, remove, remove, removeFirst, removeFirstOccurrence, removeLast, removeLastOccurrence, set, size, toArray, toArray
 
Methods inherited from class java.util.AbstractSequentialList
iterator
 
Methods inherited from class java.util.AbstractList
equals, hashCode, listIterator, removeRange, subList
 
Methods inherited from class java.util.AbstractCollection
containsAll, isEmpty, removeAll, retainAll
 
Methods inherited from class java.lang.Object
finalize, getClass, notify, notifyAll, wait, wait, wait
 
Methods inherited from interface java.util.List
containsAll, equals, hashCode, isEmpty, iterator, listIterator, removeAll, retainAll, subList
 
Methods inherited from interface java.util.Deque
iterator
 

Field Detail

pontuacao

public static String pontuacao
The set of punctuation marks.


parentise

public static String parentise
The set of text delimiters.


stx

protected String stx
Internal string representation of this sentence.


label

public String label
This label defines a sentence meta-tag.


cod

public int cod
A sentence index, used in news clustering.

Since:
2008-06-05
Constructor Detail

Sentence

public Sentence()
Default constructor.


Sentence

public Sentence(String s)
Creates a new sentence from a given string.

Parameters:
s - - The string containing a sentence.
Method Detail

compareTo

public int compareTo(Object other)
Specified by:
compareTo in interface Comparable

reload

public void reload(Sentence s)
Recreate this sentence from another one.

Parameters:
s - The other sentence.

addWord

public void addWord(Word w)
Append a new word to this sentence.

Parameters:
w - The word to be appended

set

public void set(String s)
Recreate this sentence from a given string.

Parameters:
s - The indicated string.

codify

public void codify(CorpusIndex dic)
Codifies this sentence according to a given previously processed dictionary.

Parameters:
dic - The indicated dictionary.

codify

public static void codify(Sentence... vs)
Static method to codify a bunch of sentences.

Parameters:
vs - Sentence[]

length

public int length()
This sentence string length.

Returns:
The length value.

getWord

public String getWord(int index)
Gives the sentence word positioned at a given index.

Parameters:
index - The index to read from.
Returns:
The word read in the string form.

getWords

public String[] getWords()
Gives an array of strings, containing all the words in this sentence.

Returns:
The array of words.

getTag

public String getTag(int index)
The POS tag, if defined, for a given word.

Parameters:
index - The word position in the sentence.
Returns:
The POS tag read or else the null value.

getCodes

public int[] getCodes()
Gives the array of lexical codes representing this sentence. It assumes that the sentence was already codified.

Returns:
The array of lexical codes.

getTags

public String[] getTags()
Gives the array of POS tags for that sentence, assuming it was already tagged.

Returns:
The array of tags.

isCodefied

public boolean isCodefied()
Verifies whether this sentence has been marked with a CorpusIndex object.

Returns:
The true value on success.

isPunct

public static boolean isPunct(String s)
Tests if a given string is a punctuation mark.

Parameters:
s - The string to be tested.
Returns:
The test result.

isWord

public static boolean isWord(String s)
Test if a given string is a word.

Parameters:
s - The string to be tested
Returns:
True if the string is a word.

countNumWords

public int countNumWords()
Counts the number of words in this sentence.

Returns:
The number of effective words found.

indexOf

public int indexOf(String s)
Gives the index of a string in this sentence. The input string will be compared with each sentence word and the position of the first occurence will be given.

Parameters:
s - The string to be scaned in this sentence.
Returns:
The index found, or else the -1 value will be returned.

indexOf

public int indexOf(String s,
                   int from)
Gives the index of a string occurence within this sentence, starting the search from a given position.

Parameters:
s - The string to be scaned in this string.
from - The starting index.
Returns:
The index found, or else the -1 value will be returned.
See Also:
indexOf(String s)

subs

public Sentence subs(int a,
                     int b)
Gives a sub-sentence from this sentence, between positions a and b, which should be valid. We can have a < b or b < a. It is only required that 0 < a, b, < "sentence length".

Parameters:
a - One index.
b - The other index.
Returns:
The sub-sentence

splitPunct

public Sentence[] splitPunct()
Split a sentence based on the punctuations found.

Returns:
The array of sentences obtained.

dsLevenshtein

public int dsLevenshtein(Sentence other)
This method applies the Edit Distance (ED) metric to compare this sentence with another one. The basic comparition unit is the word, not the character as in conventional ED.

Parameters:
other - The other sentence.
Returns:
The calculated distance.

subcodes

public int[] subcodes(int start,
                      int end)
Gives the array of sub-codes corresponding to a sub-sentence of this sentences. It assumes that the sentence was already some how been codified, for example through a dictionary ("CorpusIndex") previously computed.

Parameters:
start -
end -
Returns:
The sub-array of codes.

match

public static int match(int[] vsub,
                        int[] v)
Counts the number of occurrences of a sub-array inside another, presumably longer, array. This method is used for simple n-gram match counting.

Parameters:
vsub - The sub-array.
v - The longer array.
Returns:
The number of occurrences.

equalArrays

public static boolean equalArrays(int[] u,
                                  int[] v)
Verifies whether two arrays are equal.

Parameters:
u - The first array.
v - The second array.
Returns:
The test result

ctMatchNGram

public static int ctMatchNGram(int N,
                               Sentence sa,
                               Sentence sb)
Counts the number of n-grams match between two sentences.

Parameters:
N - The n-gram size.
sa - The first sentence.
sb - The second sentence.
Returns:
The number of n-gram matches.

countMatchNGram

public static int countMatchNGram(int N,
                                  Sentence sa,
                                  Sentence sb)
Counts the number of exclusive n-gram matches, between two sentences.

Parameters:
N - The n-gram size.
sa - The first sentence.
sb - The other sentence.
Returns:
The number of matches.

readLinks

public int[][] readLinks(Sentence other)
Returns the set of links between two sentences: A = {a1,a2,...an} where ak = (k1, k2) is an integer pair representing the link between word in position k1, in one sentence, and k2 the k2-th word in the other sentence.

Parameters:
other - The other sentence.
Returns:
An array of integer pairs representing the links.
Since:
2007-03-23

readLinks

public static int[][] readLinks(int[] va,
                                int[] vb)
Returns the set of links between two sentences: A = {a1,a2,...an} where ak = (k1, k2) is an integer pair representing the link between word in position k1, in one sentence, and k2 the k2-th word in the other sentence. In this method the sentences are represented by their array codes.

Parameters:
va - The first sentence array.
vb - The second sentence array.
Returns:
An array of integer pairs representing the links.
Since:
2007-03-23

countMatch

public int countMatch(String regex)

countNotMatch

public int countNotMatch(String regex)
Counts the number of words from this sentence that do not match a given regular expression.

Parameters:
regex - The indicated regular expression.
Returns:
The number of condition matches.

countIntersectLinks

public static int countIntersectLinks(Sentence sa,
                                      Sentence sb)
Counts te number of link intersections existing between the two sentences. This method is fundamental for making the runtime decision of choosing the alignment algorithm: Smith Waterman or Needleman Wunsch. This id better explained in:

Cordeiro, J.P., Dias, G.Cleuziou G. (2007). Biology Based Alignments of Paraphrases for Sentence Compression. In Proceedings of the Workshop on Textual Entailment and Paraphrasing (ACL-PASCAL / ACL2007). Prague, Czech Republic. [link].

Parameters:
sa - The first sentence.
sb - The second sentence.
Returns:
The number of link intersections.
Since:
2007-03-23

countNormIntersectLinks

public static double countNormIntersectLinks(Sentence sa,
                                             Sentence sb)
Percentage of link intersections existing between the two sentences. This method is related with the "countIntersectLinks(Sentence sa, Sentence sb)" method. The only difference is normalization.

Parameters:
sa - The first sentence
sb - The second sentence
Returns:
A value in the [0,1] interval.
Since:
2007-03-23

dsBLEU

public static double dsBLEU(Sentence sa,
                            Sentence sb)
Computes the BLEU metric between two sentences. A sentence proximity value.

Parameters:
sa - The first sentence.
sb - The second sentence.
Returns:
A value in the [0,1] interval.

dsNgram

public static double dsNgram(Sentence sa,
                             Sentence sb)
Computes the simple n-gram overlap between two sentences, with 4 as the maximum n-gram counted.

Parameters:
sa - The first sentence.
sb - The second sentence.
Returns:
A value in the [0,1] interval.

dsNgram

public static double dsNgram(int N,
                             Sentence sa,
                             Sentence sb)
Computes the simple n-gram overlap between two sentences, considering a maximum number of n-grams.

Parameters:
N - The maximum number of n-grams.
sa - The first sentence.
sb - The second sentence.
Returns:
A value in the [0,1] interval.

dsuffixArrays

public static double dsuffixArrays(Sentence sa,
                                   Sentence sb)
A metric for calculating sentence proximity, based on suffix arrays comparisons of n-grams, as defined by Church and Yamamoto.

Parameters:
sa - The first sentence.
sb - The second sentence.
Returns:
A value in the [0,1] interval.
Since:
2006-03-28

dsumo

public double dsumo(Sentence other)
The "sumo metric" for calculating the similarity between two sentences. As presented in:

Cordeiro, J.P., Dias, G. Brazdil, P. (2007). Learning Paraphrases from WNS Corpora. 20th International FLAIRS Conference. AAAI Press. Key West, Florida, USA. [link]

Parameters:
other - The other sentence to compare with.
Returns:
A value in the [0,1] interval.
Since:
2006-03-25

dsumo

public static double dsumo(int[] u,
                           int[] v)
The "sumo metric" for calculating the similarity between two sentences. represented by their arrays of codes. As presented in:

Cordeiro, J.P., Dias, G. Brazdil, P. (2007). Learning Paraphrases from WNS Corpora. 20th International FLAIRS Conference. AAAI Press. Key West, Florida, USA. [link]

Parameters:
u - The first array of codes.
v - The second array of codes.
Returns:
The similarity value in the [0,1] interval.
Since:
2006-05-04

dsumoWSize

public double dsumoWSize(Sentence other)
A different version of the sumo function for calculating sentence similarity between two sentences. The main difference consists in counting differently the lexical exclusive links between the two sentences. The "weight" of each link directly depends from the connected word sizes.

Parameters:
other - The other sentence to compare with.
Returns:
double A value in the [0,1] interval.

dsEntropy

public double dsEntropy(Sentence other)
The "entropy metric" for calculating the similarity between two sentences.

Cordeiro, J.P., Dias, G. Cleuziou G. Brazdil P. (2007). New Functions for Unsupervised Asymmetrical Paraphrase Detection. In Journal of Software. Volume:2, Issue:4, Page(s): 12-23. Academy Publisher. Finland. ISSN: 1796-217X. October 2007. [link]

Date: 2007-06-18

Parameters:
other - The other sentence to compare with.
Returns:
A value in the [0,1] interval.

dgauss

public double dgauss(Sentence other)
A simple version of the gaussian similarity between twon sentences. In this particular case the gaussian parameters are as follows: a=1, b=0.5, and c=0.3.

Cordeiro, J.P., Dias, G. Cleuziou G. Brazdil P. (2007). New Functions for Unsupervised Asymmetrical Paraphrase Detection. In Journal of Software. Volume:2, Issue:4, Page(s): 12-23. Academy Publisher. Finland. ISSN: 1796-217X. October 2007. [link]

Parameters:
other - The other sentence.
Returns:
A value in the [y0,1] interval, where y0 corresponds to x = 0.0 (no overlapping), which means that y0 = exp{-0.5^2/(2*0.3^2)} = 0.29435

dgauss

public double dgauss(Sentence other,
                     double p0,
                     double r0,
                     double sp0,
                     double sr0)
The gaussian similarity between two sentences. Like the gaussian function family, it depends from four parameters, which here have the meaning listed bellow. This function was also presented in the following article:

Cordeiro, J.P., Dias, G. Cleuziou G. Brazdil P. (2007). New Functions for Unsupervised Asymmetrical Paraphrase Detection. In Journal of Software. Volume:2, Issue:4, Page(s): 12-23. Academy Publisher. Finland. ISSN: 1796-217X. October 2007. [link]

Parameters:
other - The other sentence.
p0 - The expected precision of sentences token match.
r0 - The expected recall of sentences token match.
sp0 - The expected precision variance
sr0 - The expected recall variance.
Returns:
A value in the [0,1] interval.

dParabolic

public double dParabolic(Sentence other)
The parabolic sentence similarity metric. As presented in the following article:

Cordeiro, J.P., Dias, G. Cleuziou G. Brazdil P. (2007). New Functions for Unsupervised Asymmetrical Paraphrase Detection. In Journal of Software. Volume:2, Issue:4, Page(s): 12-23. Academy Publisher. Finland. ISSN: 1796-217X. October 2007. [link]

Parameters:
other - The other sentence.
Returns:
A value in the [0,1] interval.
Since:
2007-06-18

dLinear

public double dLinear(Sentence other)
The linear similarity metric between two sentences. It is based on the triangular function in the [0,1] interval taking as arguments the precision and recall of sentence token overlapping between the two sentences.

Parameters:
other - The other sentence
Returns:
A value in the [0,1] interval.
Since:
2007-06-18

dSin

public double dSin(Sentence other)
The trignometric function for calculating the similarity between two sentences. It is based on the sin function. Presented in the following article:

Cordeiro, J.P., Dias, G. Cleuziou G. Brazdil P. (2007). New Functions for Unsupervised Asymmetrical Paraphrase Detection. In Journal of Software. Volume:2, Issue:4, Page(s): 12-23. Academy Publisher. Finland. ISSN: 1796-217X. October 2007. [link]

Parameters:
other -
Returns:
A value in the [0,1] interval.

ensureCodification

public static void ensureCodification(Sentence... sentences)
Ensures that a given set of sentences is codified, which means that their words have been marked with a word indexer (a CorpusIndex object). If not, the set of sentences will be marked with a new and specific word indexer, constructed only from the set of sentences receive as parameter.

Parameters:
sentences - The set

distlex

public double distlex(Word w)
The minumum lexical distance of a word to any word in this sentence.

Parameters:
w - The input word.
Returns:
The minimum distance.

fracNumWords

public double fracNumWords()
The proportion of effective words contained in this sentence.

Returns:
A value in the [0,1] interval.

setMetric

public void setMetric(String smetric)
Defines which should be the default similarity function to be used in the sentence similarity computation.

Parameters:
smetric - Contains the name of the similarity function. The possible values are: ngram, xgram, bleu, edit, entropy,or sumo.
See Also:
The defined metric codes.

similarity

public double similarity(Sentence other)
Compute the similarity metric between two sentences. The "sumo metric" is the default similarity function used.

Parameters:
other - The other sentence.
Returns:
The similarity value [0,1], according to some specified metric.
See Also:
The defined metric codes.

similarity

public double similarity(Sentence other,
                         String metric)
Calculates the similarity between two sentences using a given similarity function.

Parameters:
other - The other sentence.
metric - The name of the similarity function.
Returns:
A value in the [0,1] interval.
Since:
2009-11-17

print

public void print(int a,
                  int b)
Outputs the words of this sentence, between two positions.

Parameters:
a - The fist position.
b - The second position.

print

public void print()
Outputs the string representing this sentence.


println

public void println()
Outputs all words from this sentence, one word per line.


toLowerCase

public void toLowerCase()
Converts every word to lower case and transforms their CorpusIndex codes to -1. Thus, any lexical codification will be eliminated.


toLowerCase

public void toLowerCase(CorpusIndex dic)
Converts every word to lower case and redefines each word's lexical code, basesd on a supplied dictionary.

Parameters:
dic - The dictionary.

toString

public String toString()
The overriding of the toString() method.

Overrides:
toString in class AbstractCollection<Word>
Returns:
A string representing this sentence.

toStringPOS

public String toStringPOS()
A toString() type method giving each word joined with its respective part-of-speech tag

Returns:
String

toMWUString

public String toMWUString()
Transform this sentence into a kind of a multi-word-unit (MWU) expression. Each word will be connected to their neighbors through underscores. For example, the sentence "The big cat" will give rise to "The_big_cat".

Returns:
The multi-word-unit.
Since:
2010-02-12 (Created for the work with Gintare).

mutation

public Sentence mutation(int n)
Produces a given number of random "mutations" in this sentence. This method was used in several early paraphrase detection experiments. A "mutation" consists in transforming a sentence word into a constant of mutation ("XMUT") token.

Parameters:
n - The maximum and likely number of mutations.
Returns:
A mutated sentence.

x201102012359

public static void x201102012359()

x201102281055

public static void x201102281055()
Correcções na sequência dos testes exaustivos realizados pelo Steven Burrows.


testaMetricas

public static void testaMetricas(String s1,
                                 String s2)

demoForWebPage

public static void demoForWebPage()

main

public static void main(String[] args)
The main method contains a general class tester.

Parameters:
args -