The HultigLib"Nuggets" for Text Processing in JavaVersion 1.1 |
![]() |
The HultigLib is a library gathering a set of text processing tools, written in Java language. It was designed for efficiency and scalability, in terms of the volume of text handling. Large collections of texts can be easily computed and used in a variety of applications, as for example main statistics calculations in corpora.
The three core classes of this library are: Word
, Sentence
,
and Text
. Their names trivially indicate what they represent. A sentence
is represented through a linked list of Word
elements. And similarly, a
Text
is represented by a linked list of Sentence
elements.
A number of sentence similarity functions can be found in the Sentence
,
class. These functions have been used in several research works, like paraphrase
identification in corpora and plagiarism detection. The Text
class also
provides several handful tools for representing and processing texts. We have also
integrated the "openNLP" library,
as well as implemented an "interface" class for using its main relevant features,
like part-of-speech tagging, shallow and full sentence parsing.
The source code is available under the terms of the GNU General Public License (GPL), and can be obtained from the next section of this document. The library's java documentation is available [here].
Several packages are available for downloading. They are protected by user and password. To have access to the downloads, please fill the registration form by providing your name and e-mail. The download keys will be send to your e-mail. This is just for keeping track about the number of users interested in our resources. Thank you.
OpenNLPKit
class, one just have to type:
$java -Xms32m -Xmx512m -cp hultiglib.jar hultig.sumo.OpenNLPKitin the command line.
CLASSPATH
environment
variable. The HultigLib library depends on a number of third party libraries, listed
below, which can also be downloaded independently:
This section contains several examples on using the library. We start by including the most
common situations, and new examples are going to be incrementally added here. Most of the
classes can be tested independently by executing their main method. For example, the default
testing of the Word
class can be obtained through its main
, as
follows:
$java hultig.sumo.Word
and a subset of the printed output is shown below. This assumes that CLASSPATH
is
correctly defined, "pointing" to the necessary libraries (see the "Downloads" section).
-------------------------------------------------------------------------------------------------------------------- | WORD A | WORD B | edit dlex maxseq (1) (2) (3) -------------------------------------------------------------------------------------------------------------------- correr correndo ---> 3 0.0546875 0.6250000 0.2583661 4.7244094 0.4427847 reutilizar utilizar ---> 2 1.9355469 0.8000000 4.7791281 2.4691358 0.6447420 arroz atroz ---> 1 0.5000000 0.6000000 0.8196721 1.6393443 0.7490443 arroz arrozal ---> 2 0.0468750 0.7142857 0.1294379 2.7613412 0.6116181 informatica informatizado ---> 3 0.0026855 0.6923077 0.0114717 4.2716320 0.4750068 the thin ---> 2 0.3750000 0.5000000 1.4705882 3.9215686 0.5006360 in include ---> 5 0.4843750 0.2857143 8.1899155 16.9082126 0.1435395 in by ---> 2 1.5000000 0.0000000 300.0000000 200.0000000 0.0012732 of by ---> 2 1.5000000 0.0000000 300.0000000 200.0000000 0.0012732 governor governed ---> 2 1.5000000 0.7500000 0.0616776 2.6315789 0.6260573 pay paying ---> 3 1.9687500 0.5000000 1.2867647 5.8823529 0.3749214 hamburger spiritual ---> 9 1.9960938 0.1111111 148.3335722 74.3119266 0.0316947 reinterpretation interpreted ---> 7 1.9999695 0.5625000 24.2388463 12.2270742 0.1983173 -------------------------------------------------------------------------------------------------------------------- LEGEND: (1) - edit*dlex/maxseq (2) - edit/maxseq (3) - connectProb(Wa,Wb): Connection probability
The output shows the lexical similarity values, calculated for a set of word pairs, using
different similarity functions defined in Word
. For example the edit
column contains the results obtained through the Edit Distance function, while in
the last column we have the results obtained with our proposed function, in [3]. In fact,
here we have the connection likelihood, and in the article we use the cosAlign(Wa,Wb)
function, which is the complement, that is: cosAlign(Wa,Wb) = 1.0 - connectProb(Wa,Wb)
.
The "Sentence
" Example
The Sentence
class represents a textual sentence as a linked list of words
(an extension of LinkedList<Word>
). One of the main features implemented
in Sentence
is the set of sentence similarity (or dissimilarity) methods. These
methods have been throughly experimented and used in a number of research works, which we
highlight [2,4,5,6,7]. The work in [4] is specially dedicated to make a comparative study
among these and other methods, for paraphrase identification on corpora. Bellow is a small
example, using three sentences and two similarity methods. Other methods can be found
in the correspondent
[technical documentation].
Assuming that we have the following three sentences:
S1
)
S2
)
S3
)
next are the similarity values calculated between these sentences, respectively through the Sumo and Levenshtein methods:
[SENTENCE SIMILARITY - METHOD: Sumo] S1 S2 S3 S1 0,0000000 0,9074841 0,0000095 S2 0,9074841 0,0000000 0,0000159 S3 0,0000095 0,0000159 0,0000000 [SENTENCE DISSIMILARITY - METHOD: Edit Distance] S1 S2 S3 S1 0 15 18 S2 15 0 14 S3 18 14 0
This is the default output, by executing hultig.sumo.Sentence
class, and the correspondent
code segment can also be seen [here].
The "Text
" Example
The Text
class comprehends
a unit of text, which can range from a simple paragraph until a greater amounts. Internally, a text is
represented as a list of sentences (LinkedList<Sentence>
). Textual string can be
dynamically added to a Text
object, even containing several sentences. In such cases the
multi-sentence input string will be processed regarding sentence boundary detection. For example we can
have the following segment:
String s = "Pierre Vinken, 61 years old, will join the board as a " + "nonexecutive director Nov. 29. Mr. Vinken is chairman of " + "Elsevier N.V., the Dutch publishing group. Rudolph Agnew, " + "55 years old and former chairman of Consolidated Gold " + "Fields PLC, was named a director of this British " + "industrial conglomerate." ; Text txt = new Text(s); for (int k = 0; k < txt.size(); k++) { Sentence stk = txt.getSentence(k); System.out.println(stk); }
the output would be:
Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29. Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group. Rudolph Agnew, 55 years old and former chairman of Consolidated Gold Fields ... British industrial conglomerate.
showing that three sentences where correctly identified. We are using
Scott Piao's sentence breaker. A larger
demo is shown, through the execution of the main
method ($java hultig.sumo.Text
),
including our newly proposed text similarity function, inspired by the vector space model, but more
efficient.
The "CorpusIndex
" Example
The CorpusIndex
class exists
for efficiency reasons, specially when huge amounts of text are on stake. For a given text collection,
a CorpusIndex
object associates a unique numeric index to each word. As a result, word comparisons
are performed quickly, through their indexes. For large collections of text, the indexer should be computed prior
to any further text processing. An indexer can be saved/loaded in a file. We can load an indexer previously computed
and then codify the text that we are going to work with. The following code segment illustrates a simple dynamic
creation of a small indexer:
String s1= "Radiation from this solar flare will hit Earth's magnetic field on Wednesday"; String s2= "Our magnetic field will be affected, next Wednesday, by this solar flare."; String s3= "Tim Cook and Philip Schiller unveil the company's newest iPad."; CorpusIndex dict= new CorpusIndex(); //==> Creates the indexer dict.add(s1); //==> Adds one string. dict.add(s2); //==> Adds another string. Text t= new Text(s2); //==> Creates a new text, from string s2. t.codify(dict); //==> Codifies the created text. dict.add(s3); //====> Adds a new string to the indexer. dict.rebuild(); //==> rebuilds the indexer - words will get a //> new code. t.add(s3); //=======> Adds a new string to the text. t.codify(dict); //==> Recodifies the text, now with two sentences, //> with the rebuild indexer.
In a dynamic indexer construction, as previously illustrated, the rebuild()
method must be called
for updating the word indexes, since new words may have been added.
This section lists the known publications of research work that uses this library. The list is ordered by the date of the publication, and sequentially numbered. This number is used as the reference publication, in this document.