The HultigLib

"Nuggets" for Text Processing in Java

Version 1.0

header.jpg

Description

The HultigLib is a library gathering a set of text processing tools, written in Java language. It was designed for efficiency and scalability, in terms of the volume of text handling. Large collections of texts can be easily computed and used in a variety of applications, as for example main statistics calculations in corpora.

The three core classes of this library are: Word, Sentence, and Text. Their names trivially indicate what they represent. A sentence is represented through a linked list of Word elements. And similarly, a Text is represented by a linked list of Sentence elements. A number of sentence similarity functions can be found in the Sentence, class. These functions have been used in several research works, like paraphrase identification in corpora and plagiarism detection. The Text class also provides several handful tools for representing and processing texts. We have also integrated the "openNLP" library, as well as implemented an "interface" class for using its main relevant features, like part-of-speech tagging, shallow and full sentence parsing.

The source code is available under the terms of the GNU General Public License (GPL), and can be obtained from the next section of this document. The library's java documentation is available [here].


Downloads

Several packages are available for downloading. They are protected by user and password. To have access to the downloads, please fill the registration form by providing your name and e-mail. The download keys will be send to your e-mail. This is just for keeping track about the number of users interested in our resources. Thank you.


Your Name:
Email Address:
receive new version notifications.