"Nuggets" for Text Processing in Java
The HultigLib is a library gathering a set of text processing tools, written in Java language. It was designed for efficiency and scalability, in terms of the volume of text handling. Large collections of texts can be easily computed and used in a variety of applications, as for example main statistics calculations in corpora.
The three core classes of this library are:
Text. Their names trivially indicate what they represent. A sentence
is represented through a linked list of
Word elements. And similarly, a
Text is represented by a linked list of
A number of sentence similarity functions can be found in the
class. These functions have been used in several research works, like paraphrase
identification in corpora and plagiarism detection. The
Text class also
provides several handful tools for representing and processing texts. We have also
integrated the "openNLP" library,
as well as implemented an "interface" class for using its main relevant features,
like part-of-speech tagging, shallow and full sentence parsing.
Several packages are available for downloading. They are protected by user and password. To have access to the downloads, please fill the registration form by providing your name and e-mail. The download keys will be send to your e-mail. This is just for keeping track about the number of users interested in our resources. Thank you.