hultig.sumo
Class Word

java.lang.Object
  extended by hultig.sumo.Word
All Implemented Interfaces:
Serializable

public class Word
extends Object
implements Serializable

A class to represent and process a textual word.

University of Beira Interior (UBI)
Centre For Human Language Technology and Bioinformatics (HULTIG)

See Also:
Serialized Form

Field Summary
 ChunkTag CHTAG
           
 int[] cods
          Introduced later, in June 2008.
 long FREQ
           
static String RPUNCT
           
static long serialVersionUID
           
 
Constructor Summary
Word()
          Default constructor.
Word(String word)
          Create a new word from a given received String.
Word(String word, int syntcod)
          Create a Word and mark it with a syntactic code.
Word(String word, String meta_item)
          Create a new word from a given received String.
Word(String word, String[] meta)
          Create a word, labeling it with an array of multi-tags.
 
Method Summary
 char charAt(int k)
          Access a word character at a given position.
 double connectProb(Word w)
          Similar to costAlign but inverted and normalized in the [0, 1] interval.
 double costAlign(Word w)
          Cost of aligning two words.
 double distcos(Word w)
          Another lexical metric, based on the cosine.
 float distlex(String s)
          Calls "distlex(word.toString(), s, 2f)".
static float distlex(String sa, String sb)
          Calls "distlex(sa, sb, 2f)".
static float distlex(String sa, String sb, float q)
          Implements a metric that calculates the lexical distance between two words.
 float distlex(Word w)
          Calls "distlex(word.toString(), w.toString(), 2f)"
 float distlex(Word w, float q)
          Calls "distlex(word.toString(), w.toString(), q)"
static float distlexSuffix(String sa, String sb)
          Calls "distlexSuffix(sa, sb, 2f)".
static float distlexSuffix(String sa, String sb, float q)
          This method implements a similar metric as in "distlex".
static double distSeqMax(String sa, String sb)
          A normalized Edit Distance which normalizes by taking the maximum common sequence between the two sentences (Presented at ACL 2007).
 double dnormEditDistance(Word w)
          Computes a normalized Edit Distance of two words.
static int editDistance(String s, String t)
          Computes Levenshtein Distance, also known as the Edit Distance
 int editDistance(Word w)
          Calls the method "editDistance(this.toString(), w.toString())"
 int editProximity(Word w)
          The Edit Distance complement.
 boolean equals(Word w)
          Equality test for two words, this and the other one.
 int getChkCod()
          Obtain the chunk code.
 int getLexCod()
          Obtain this word lexical code.
 String getMetaValue(String metatag)
          Return a given meta-tag value associated with this word.
 String getPOS()
          Gives the POS tag of this word.
 String getPOS(int size)
          Get the first @param size chars, from the POS label.
 String getPOS(POSType post)
           
 int getPosCod()
          Obtain the POS code.
 String getTag()
          Get the POS tag of this word, if any is defined.
 boolean hasPOS()
          Test whether this word is POS tagged or not.
 boolean isEmpty()
          Test whether this word is undefined or not.
 boolean isNumWord()
          Test if this is a number or a word.
static boolean isPunct(char c)
          Test if a given character is a punctuation mark.
 boolean isRPUNCT()
          Test whether this is a punctuation mark.
 boolean isWord()
          Test if whether this is really a word, and not for example a number or a punctuation mark, or any other token.
 int length()
          Gives the word length.
static void main(String[] args)
          The main method tests this class by executing several experiments for a predefined set of word pairs.
 void posLabel(POSType post)
           
 void set(String word)
          Redefines this word based on the received string, which is assumed to contain just the alpha sequence representing a single word.
 void set(String word, String meta_item)
          Redefines this word based on the received string, which is assumed to contain just the alpha sequence representing a single word.
 void set(String word, String[] meta)
           
 void setChkCod(int chkcod)
          Sets the chunk code of this word, meaning that this word is contained in a chunk (shallow parsing) with that code.
 void setLexCod(int lexcod)
          Defines the word lexical code.
 void setMetaTag(String metatag, String value)
           
 void setPOS(char[] v)
          Returns the POS tag of this word, to a valid POS tag.
 void setPOS(String tag)
          Returns the POS tag of this word, to a valid POS tag.
 void setPosCod(int poscod)
          Sets the POS tag code for this word.
 String toLowerCase()
          Convert all characters from this word to lower case.
 String toString()
          Override of the toString() method.
 String toString(boolean with_pos_tags)
          A specific toString method.
 String toStringPOS()
          A toString() type method giving the word string concatenated with its part-of-speech tag, if defined.
 String toStringPOS(POSType postype)
          Similar to the toStringPOS() method, except that the part-of-speech representation is passed by parameter.
static String words2StringPOS(Word[] words, POSType post)
          Transform an array of words into a single string, with each word concatenated with its POS tag.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

serialVersionUID

public static final long serialVersionUID
See Also:
Constant Field Values

cods

public int[] cods
Introduced later, in June 2008. The idea is to use several codes representing different kind of tags, lexical, syntactical, among possibly others. So far the first three positions are used to store respectively the lexical, POS, and chunker codes.


CHTAG

public ChunkTag CHTAG

RPUNCT

public static String RPUNCT

FREQ

public long FREQ
Constructor Detail

Word

public Word()
Default constructor.


Word

public Word(String word)
Create a new word from a given received String. It assumes that this string contains in fact a sequence of characters representing just one word. This constructor invokes the
set(String word)
method.

Parameters:
word - The String containing the word.

Word

public Word(String word,
            String meta_item)
Create a new word from a given received String. It assumes that this string contains in fact a sequence of characters representing just one word. This constructor invokes the
set(String word)
method. The created word is also labeled with a meta-tag.

Parameters:
word - The String containing the word.
meta_item - The meta-tag labeling the created word.

Word

public Word(String word,
            String[] meta)
Create a word, labeling it with an array of multi-tags.

Parameters:
word - The String containing the word.
meta - The array of multi-tags.

Word

public Word(String word,
            int syntcod)
Create a Word and mark it with a syntactic code.

Parameters:
word -
syntcod -
Method Detail

set

public final void set(String word)
Redefines this word based on the received string, which is assumed to contain just the alpha sequence representing a single word.

Parameters:
word - The received string.

set

public void set(String word,
                String meta_item)
Redefines this word based on the received string, which is assumed to contain just the alpha sequence representing a single word. The created word also receives a meta-tag.

Parameters:
word - The received string.
meta_item - The meta-tag associated with this word.

charAt

public char charAt(int k)
            throws IndexOutOfBoundsException
Access a word character at a given position.

Parameters:
k - The position to read.
Returns:
The character read.
Throws:
IndexOutOfBoundsException

setLexCod

public void setLexCod(int lexcod)
Defines the word lexical code.

Parameters:
lexcod - The code.

setPosCod

public void setPosCod(int poscod)
Sets the POS tag code for this word.

Parameters:
poscod - The POS code.

setChkCod

public void setChkCod(int chkcod)
Sets the chunk code of this word, meaning that this word is contained in a chunk (shallow parsing) with that code.

Parameters:
chkcod - The chunk code.

getLexCod

public int getLexCod()
Obtain this word lexical code.

Returns:
The lexical code.

getPosCod

public int getPosCod()
Obtain the POS code.

Returns:
The code.

getChkCod

public int getChkCod()
Obtain the chunk code.

Returns:
The code.

set

public void set(String word,
                String[] meta)

setMetaTag

public void setMetaTag(String metatag,
                       String value)

hasPOS

public boolean hasPOS()
Test whether this word is POS tagged or not.

Returns:
boolean

getPOS

public String getPOS()
Gives the POS tag of this word.

Returns:
String The POS tag.

getPOS

public String getPOS(POSType post)

getPOS

public String getPOS(int size)
Get the first @param size chars, from the POS label.

Parameters:
size - int
Returns:
String

setPOS

public void setPOS(char[] v)
Returns the POS tag of this word, to a valid POS tag.

Parameters:
v - char[]

setPOS

public void setPOS(String tag)
Returns the POS tag of this word, to a valid POS tag.

Parameters:
tag - String

toLowerCase

public String toLowerCase()
Convert all characters from this word to lower case.


equals

public boolean equals(Word w)
Equality test for two words, this and the other one.

Parameters:
w - The other word.
Returns:

toString

public String toString()
Override of the toString() method.

Overrides:
toString in class Object
Returns:
String The string of this word.

toString

public String toString(boolean with_pos_tags)
A specific toString method. If the parameter flag is true the toStringPOS() method is invoked.

Parameters:
with_pos_tags - The part-of-speech flag.
Returns:
The string representing this word.

posLabel

public void posLabel(POSType post)

toStringPOS

public String toStringPOS(POSType postype)
Similar to the toStringPOS() method, except that the part-of-speech representation is passed by parameter.

Parameters:
postype - The POS representation.
Returns:
Examples: "the/DT", "cat/NN", "is/VBZ", "flying/VBG".

toStringPOS

public String toStringPOS()
A toString() type method giving the word string concatenated with its part-of-speech tag, if defined.

Returns:
Examples: "the/DT", "cat/NN", "is/VBZ", "flying/VBG".

words2StringPOS

public static String words2StringPOS(Word[] words,
                                     POSType post)
Transform an array of words into a single string, with each word concatenated with its POS tag.

Parameters:
words - The array of words.
post - The POS representation.
Returns:
The concatenated word string.

getTag

public String getTag()
Get the POS tag of this word, if any is defined.

Returns:
The POS tag or null.

getMetaValue

public String getMetaValue(String metatag)
Return a given meta-tag value associated with this word. Meta-tags are stored in the "META" list, where each element is a pair of the form "type=value". For example: "polarity=positive".

Parameters:
metatag - The meta-tag (ex: "polarity")
Returns:
The value for that meta-tag (ex: "positive").

isEmpty

public boolean isEmpty()
Test whether this word is undefined or not.

Returns:
The boolean test result.

isPunct

public static boolean isPunct(char c)
Test if a given character is a punctuation mark.

Parameters:
c - The character to be tested.
Returns:
The boolean test result.

isWord

public boolean isWord()
Test if whether this is really a word, and not for example a number or a punctuation mark, or any other token.

Returns:
The boolean test result.

isNumWord

public boolean isNumWord()
Test if this is a number or a word. That is, we have either a sequence of letters or a sequence of digits.

Returns:
The boolean test result.

isRPUNCT

public boolean isRPUNCT()
Test whether this is a punctuation mark.

Returns:
The boolean test result.

length

public int length()
Gives the word length.

Returns:
The length.

distlex

public static float distlex(String sa,
                            String sb,
                            float q)
Implements a metric that calculates the lexical distance between two words. It is based on the idea of the prefix significance, i.e. the farther we are from the words starting positions the less significant will be the differences.

Parameters:
sa - One word string.
sb - The other word string.
q - A formula parameter.
Returns:
The calculated distance.

distlexSuffix

public static float distlexSuffix(String sa,
                                  String sb,
                                  float q)
This method implements a similar metric as in "distlex". The main difference is that the most significative character here is the word's last one.

Parameters:
sa - One word string.
sb - The other word string.
q - A formula parameter
Returns:
The calculated distance.

distlexSuffix

public static float distlexSuffix(String sa,
                                  String sb)
Calls "distlexSuffix(sa, sb, 2f)".

Parameters:
sb - The other word string.
q - A formula parameter
Returns:
The calculated distance.

distlex

public static float distlex(String sa,
                            String sb)
Calls "distlex(sa, sb, 2f)".

Parameters:
sb - The other word string.
q - A formula parameter.
Returns:
The calculated distance.

distlex

public float distlex(String s)
Calls "distlex(word.toString(), s, 2f)".

Parameters:
s - The other word string.
Returns:
The calculated distance.

distlex

public float distlex(Word w,
                     float q)
Calls "distlex(word.toString(), w.toString(), q)"

Parameters:
w - The other word.
q - A formula parameter
Returns:
The calculated distance.

distlex

public float distlex(Word w)
Calls "distlex(word.toString(), w.toString(), 2f)"

Parameters:
w - The other word.
Returns:
The calculated distance.

distcos

public double distcos(Word w)
Another lexical metric, based on the cosine. Each word is transformed into a vector.

Parameters:
w - The other word.
Returns:
The calculated distance.

dnormEditDistance

public double dnormEditDistance(Word w)
Computes a normalized Edit Distance of two words. The Edit Distance value is divided by the maximum length of the two words.

Parameters:
w - The other word to compare to.
Returns:
The normalized (in [0,1]) distance.

distSeqMax

public static double distSeqMax(String sa,
                                String sb)
A normalized Edit Distance which normalizes by taking the maximum common sequence between the two sentences (Presented at ACL 2007).

Parameters:
sa - One string.
sb - The other string.
Returns:
double

editProximity

public int editProximity(Word w)
The Edit Distance complement. This is calculated as follows:
   size(max(wa,wb)) - editDistance(wa, wb)

Parameters:
w - The other word to compare to.
Returns:
The calculated value.

editDistance

public int editDistance(Word w)
Calls the method "editDistance(this.toString(), w.toString())"

Parameters:
w - The other word.
Returns:
The calculated value.

editDistance

public static int editDistance(String s,
                               String t)
Computes Levenshtein Distance, also known as the Edit Distance

Parameters:
s - One string.
t - The other string.
Returns:
The calculated value.

costAlign

public double costAlign(Word w)
Cost of aligning two words. This formula was used to compute a "Word Mutation Matrix", like in the gene mutation matrixes, in Biology. (Published on ACL07, by JPC).

   costAlign: Word x Word |-------> [0, +00[ 

Remark that this function is a cost, it means that the grather the value the more unlikely the alignment will be.

Parameters:
w - Word
Returns:
double

connectProb

public double connectProb(Word w)
Similar to costAlign but inverted and normalized in the [0, 1] interval. This function express the lexical conectivity between two words.

Parameters:
w - Word
Returns:
double

main

public static void main(String[] args)
The main method tests this class by executing several experiments for a predefined set of word pairs.

Parameters:
args - String[]