The HultigLib Project

Description

The HultigLib is a library gathering a set of text processing tools, written in Java language. It was designed for efficiency and scalability, in terms of the volume of text handling. Large collections of texts can be easily computed and used in a variety of applications, as for example main statistics calculations in corpora.

The three core classes of this library are: Word, Sentence, and Text. Their names trivially indicate what they represent. A sentence is represented through a linked list of Word elements. And similarly, a Text is represented by a linked list of Sentence elements. A number of sentence similarity functions can be found in the Sentence, class. These functions have been used in several research works, like paraphrase identification in corpora and plagiarism detection. The Text class also provides several handful tools for representing and processing texts. We have also integrated the "openNLP" library, as well as implemented an "interface" class for using its main relevant features, like part-of-speech tagging, shallow and full sentence parsing.

The source code is available under the terms of the GNU General Public License (GPL), and can be obtained from the next section of this document. The library's java documentation is available [here].

Downloads

Several packages are available for downloading. They are protected by user and password. To have access to the downloads, please fill the registration form by providing your name and e-mail. The download keys will be send to your e-mail. This is just for keeping track about the number of users interested in our resources. Thank you.

Huge (89.4MB) [download] — Everything what is needed is self contained in a single jar file, including all necessary third party libraries and language models. In this version, openNLP language models for English and Portuguese are included, meaning for example that in order to test the OpenNLPKit class, one just have to type:
```
   $java -Xms32m -Xmx512m -cp hultiglib.jar hultig.sumo.OpenNLPKit
```
in the command line.
Large (7MB) [download] — One single jar file containing all the necessary third party libraries, without any language models.
Small (357KB) [download] — Contains only the project classes. In order to test and use the library's functionalities, all third party libraries must have been obtained and their locations included into the CLASSPATH environment variable. The HultigLib library depends on a number of third party libraries, listed below, which can also be downloaded independently:
- OpenNLP 1.5 (opennlp-tools-1.5.0.jar) — The OpenNLP project.
- Trove 2.0.1 (trove-2.0.1.jar) — High speed regular and primitive collections for Java. Used by OpenNLP.
- Max Entropy 3.0.0 (maxent-3.0.0.jar) — The OpenNLP Maximum Entropy Package.
- ICU for Java, 4.0 (icu4j.jar) — International Components for Unicode.
- Scott Piao's sentence breaker (spiaotools.jar)
Source (164KB) [download] — You can download a A zip file with all the source code included. Note that the code is available under the terms of the GNU General Public License (GPL).

Examples

This section contains several examples on using the library. We start by including the most common situations, and new examples are going to be incrementally added here. Most of the classes can be tested independently by executing their main method. For example, the default testing of the Word class can be obtained through its main, as follows:

             $java hultig.sumo.Word

and a subset of the printed output is shown below. This assumes that CLASSPATH is correctly defined, "pointing" to the necessary libraries (see the "Downloads" section).

       --------------------------------------------------------------------------------------------------------------------
       |     WORD A      |     WORD B     |           edit    dlex       maxseq         (1)            (2)           (3)
       --------------------------------------------------------------------------------------------------------------------
                 correr          correndo      --->     3   0.0546875  0.6250000      0.2583661      4.7244094    0.4427847
             reutilizar          utilizar      --->     2   1.9355469  0.8000000      4.7791281      2.4691358    0.6447420
                  arroz             atroz      --->     1   0.5000000  0.6000000      0.8196721      1.6393443    0.7490443
                  arroz           arrozal      --->     2   0.0468750  0.7142857      0.1294379      2.7613412    0.6116181
            informatica     informatizado      --->     3   0.0026855  0.6923077      0.0114717      4.2716320    0.4750068
                    the              thin      --->     2   0.3750000  0.5000000      1.4705882      3.9215686    0.5006360
                     in           include      --->     5   0.4843750  0.2857143      8.1899155     16.9082126    0.1435395
                     in                by      --->     2   1.5000000  0.0000000    300.0000000    200.0000000    0.0012732
                     of                by      --->     2   1.5000000  0.0000000    300.0000000    200.0000000    0.0012732
               governor          governed      --->     2   1.5000000  0.7500000      0.0616776      2.6315789    0.6260573
                    pay            paying      --->     3   1.9687500  0.5000000      1.2867647      5.8823529    0.3749214
              hamburger         spiritual      --->     9   1.9960938  0.1111111    148.3335722     74.3119266    0.0316947
       reinterpretation       interpreted      --->     7   1.9999695  0.5625000     24.2388463     12.2270742    0.1983173
       --------------------------------------------------------------------------------------------------------------------
       LEGEND:
           (1) - edit*dlex/maxseq
           (2) - edit/maxseq
           (3) - connectProb(Wa,Wb): Connection probability

The output shows the lexical similarity values, calculated for a set of word pairs, using different similarity functions defined in Word. For example the edit column contains the results obtained through the Edit Distance function, while in the last column we have the results obtained with our proposed function, in [3]. In fact, here we have the connection likelihood, and in the article we use the cosAlign(Wa,Wb) function, which is the complement, that is: cosAlign(Wa,Wb) = 1.0 - connectProb(Wa,Wb).

The "Sentence" Example
The Sentence class represents a textual sentence as a linked list of words (an extension of LinkedList<Word>). One of the main features implemented in Sentence is the set of sentence similarity (or dissimilarity) methods. These methods have been throughly experimented and used in a number of research works, which we highlight [2,4,5,6,7]. The work in [4] is specially dedicated to make a comparative study among these and other methods, for paraphrase identification on corpora. Bellow is a small example, using three sentences and two similarity methods. Other methods can be found in the correspondent [technical documentation]. Assuming that we have the following three sentences:

Radiation from this solar flare will hit Earth's magnetic field on Wednesday, with impact on air traffic. (S1)

Our magnetic field will be affected, next Wednesday, by this solar flare. (S2)

Tim Cook and Philip Schiller unveil the company's newest iPad. (S3)

next are the similarity values calculated between these sentences, respectively through the Sumo and Levenshtein methods:

        [SENTENCE SIMILARITY - METHOD: Sumo]

                S1          S2          S3

           S1   0,0000000   0,9074841   0,0000095

           S2   0,9074841   0,0000000   0,0000159

           S3   0,0000095   0,0000159   0,0000000



        [SENTENCE DISSIMILARITY - METHOD: Edit Distance]

                  S1     S2     S3

           S1      0     15     18

           S2     15      0     14

           S3     18     14      0

This is the default output, by executing hultig.sumo.Sentence class, and the correspondent code segment can also be seen [here].

The "Text" Example
The Text class comprehends a unit of text, which can range from a simple paragraph until a greater amounts. Internally, a text is represented as a list of sentences (LinkedList<Sentence>). Textual string can be dynamically added to a Text object, even containing several sentences. In such cases the multi-sentence input string will be processed regarding sentence boundary detection. For example we can have the following segment:

        String s =
                "Pierre Vinken, 61 years old, will join the board as a "
                + "nonexecutive director Nov. 29. Mr. Vinken is chairman of "
                + "Elsevier N.V., the Dutch publishing group. Rudolph Agnew, "
                + "55 years old and former chairman of Consolidated Gold "
                + "Fields PLC, was named a director of this British "
                + "industrial conglomerate."
                ;
        Text txt = new Text(s);
        for (int k = 0; k < txt.size(); k++) {
            Sentence stk = txt.getSentence(k);
            System.out.println(stk);
        }

the output would be:

        Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov.
        29. Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group.
        Rudolph Agnew, 55 years old and former chairman of Consolidated Gold Fields ... British industrial conglomerate.

showing that three sentences where correctly identified. We are using Scott Piao's sentence breaker. A larger demo is shown, through the execution of the main method ($java hultig.sumo.Text), including our newly proposed text similarity function, inspired by the vector space model, but more efficient.

The "CorpusIndex" Example
The CorpusIndex class exists for efficiency reasons, specially when huge amounts of text are on stake. For a given text collection, a CorpusIndex object associates a unique numeric index to each word. As a result, word comparisons are performed quickly, through their indexes. For large collections of text, the indexer should be computed prior to any further text processing. An indexer can be saved/loaded in a file. We can load an indexer previously computed and then codify the text that we are going to work with. The following code segment illustrates a simple dynamic creation of a small indexer:

        String s1= "Radiation from this solar flare will hit Earth's magnetic field on Wednesday";
        String s2= "Our magnetic field will be affected, next Wednesday, by this solar flare.";
        String s3= "Tim Cook and Philip Schiller unveil the company's newest iPad.";

        CorpusIndex dict= new CorpusIndex(); //==> Creates the indexer
        dict.add(s1); //==> Adds one string.
        dict.add(s2); //==> Adds another string.

        Text t= new Text(s2); //==> Creates a new text, from string s2.
        t.codify(dict); //==> Codifies the created text.

        dict.add(s3); //====> Adds a new string to the indexer.
        dict.rebuild(); //==> rebuilds the indexer - words will get a
                          //> new code.
        t.add(s3); //=======> Adds a new string to the text.
        t.codify(dict); //==> Recodifies the text, now with two sentences,
                          //> with the rebuild indexer.

In a dynamic indexer construction, as previously illustrated, the rebuild() method must be called for updating the word indexes, since new words may have been added.

References

This section lists the known publications of research work that uses this library. The list is ordered by the date of the publication, and sequentially numbered. This number is used as the reference publication, in this document.

Cordeiro, J.P., Dias, G. Brazdil, P. (2007). A Metric for Paraphrase Detection. 2nd International Multi-Conference on Computing in the Global Information Technology. IEEE Computer Society Press. Guadeloupe, France.
Cordeiro, J.P., Dias, G. Brazdil, P. (2007). Learning Paraphrases from WNS Corpora. 20th International FLAIRS Conference. AAAI Press. Key West, Florida, USA.
Cordeiro, J.P., Dias, G. Cleuziou G. (2007). Biology Based Alignments of Paraphrases for Sentence Compression. In Proceedings of the Workshop on Textual Entailment and Paraphrasing (ACL-PASCAL / ACL2007). Prague, Czech Republic.
Cordeiro, J.P., Dias, G. Cleuziou G. Brazdil P. (2007). New Functions for Unsupervised Asymmetrical Paraphrase Detection. In Journal of Software. Volume:2, Issue:4, Page(s): 12-23. Academy Publisher. Finland. ISSN: 1796-217X. October 2007.
Grigonyté, G., Cordeiro, J.P., Moraliyski, R., Dias, G., Brazdil, P. (2010). A Paraphrase Alignment for Synonym Evidence Discovery. 23rd International Conference on Computational Linguistics (COLING 2010). Beijing, China, August 23-27.
Dias, G., Moraliyski, R., Cordeiro, J.P., Doucet, A., Ahonen-Myka, H. (2010). Automatic Discovery of Word Semantic Relations using Paraphrase Alignment and Distributional Lexical Semantics Analysis. In Journal of Natural Language Engineering. Special Issue on Distributional Lexical Semantics. (Guest Eds) Roberto Basisli Marco Pennacchiotti. Volume 16, issue 04, Pages 439--467. Cambridge University Press. ISSN 1351-3249.
Burrows, S., Pottahst, M., Stein, B. (2012). Paraphrase Acquisition via Crowdsourcing and Machine Learning. Transactions on Intelligent Systems and Technology, Vol. V, No. N, January 2012, Pages 1--22. ACM.

The HultigLib

"Nuggets" for Text Processing in Java

Description

Downloads

Examples

References