The HultigLib"Nuggets" for Text Processing in JavaVersion 1.0 |
The HultigLib is a library gathering a set of text processing tools, written in Java language. It was designed for efficiency and scalability, in terms of the volume of text handling. Large collections of texts can be easily computed and used in a variety of applications, as for example main statistics calculations in corpora.
The three core classes of this library are: Word
, Sentence
,
and Text
. Their names trivially indicate what they represent. A sentence
is represented through a linked list of Word
elements. And similarly, a
Text
is represented by a linked list of Sentence
elements.
A number of sentence similarity functions can be found in the Sentence
,
class. These functions have been used in several research works, like paraphrase
identification in corpora and plagiarism detection. The Text
class also
provides several handful tools for representing and processing texts. We have also
integrated the "openNLP" library,
as well as implemented an "interface" class for using its main relevant features,
like part-of-speech tagging, shallow and full sentence parsing.
The source code is available under the terms of the GNU General Public License (GPL), and can be obtained from the next section of this document. The library's java documentation is available [here].
Several packages are available for downloading. They are protected by user and password. To have access to the downloads, please fill the registration form by providing your name and e-mail. The download keys will be send to your e-mail. This is just for keeping track about the number of users interested in our resources. Thank you.