Serialized Form
br
BufferedReader br
encode
String encode
dictionary
CorpusIndex dictionary
- The corpus index reference for this class.
VCLUSTERS
ArrayList<E> VCLUSTERS
- The list of news clusters loaded.
writer
PrintWriter writer
encode
String encode
base
String base
- File base name - example: for file "fxyz.dat", "fxyz" would be
its base name, and ".dat" its extension token.
ext
String ext
base
String base
- File base name - example: for file "fxyz.dat", "fxyz" would be
its base name, and ".dat" its extension token.
ext
String ext
CHUNK_VALUE
double[] CHUNK_VALUE
- To store a numerical value for each chunk type. It was conceived to compute
sentence proximity based on the proximity of their chunks. The idea is to
differently weight different chunk types, for example giving more value to
NP
and VP
chunks.
vcmark
ChunkMark[] vcmark
- The array of chunk marks defining the sentence chunk boundaries and types.
sdict
TreeMap<K,V> sdict
- A corpora index with the words/tokens being
the keys. Given a word we can obtain its
numeric index.
idict
TreeMap<K,V> idict
- A corpora index with the numeric index being
the key. Given a numeric index we can get
the corresponding word.
hstab
Hashtable<K,V> hstab
- An hash table for counting word frequencies in corpora.
TRUNCV
int TRUNCV
- Size of word truncation. If this value is greater than
zero, the corpora read tokens will be truncated, they
are stored with
TRUNCV
maximum length.
ENCODE
String ENCODE
- The text encoding string used to read the text corpora,
for example
UTF-8
, or ISO-8859-1
.
N
int N
- The n-gram dimensionality: 2-gram, 3-gram, ... The
default is a 2-gram, also mentioned as a bigram.
soma
long soma
- The sum of frequencies - the number of processed tokens.
hsubngram
Hashtable<K,V> hsubngram
- The n-gram table, holding the frequency of each n-gram
in the processed corpora.
dictionary
CorpusIndex dictionary
ran
Random ran
MODE
int MODE
- Holds the sorting criteria.
STYPE
RuleList.SortType STYPE
stx
String stx
- Internal string representation of this sentence.
label
String label
- This label defines a sentence meta-tag.
metric
hultig.sumo.Sentence.Metric metric
cod
int cod
- A sentence index, used in news clustering.
- Since:
- 2008-06-05
CINDEX
CorpusIndex CINDEX
- The corpus index used for this text.
VOCAB
HashMap<K,V> VOCAB
- Dynamically stores the vocabulary of this text.
NUMTOKENS
int NUMTOKENS
- The total number of tokens in this text.
serialVersionUID: -5223039887894735826L
word
String word
META
Vector<E> META
cods
int[] cods
- Introduced later, in June 2008. The idea is to use several codes
representing different kind of tags, lexical, syntactical, among
possibly others. So far the first three positions are used to
store respectively the lexical, POS, and chunker codes.
POS
char[] POS
- Holds the part-of-speech tag of this word.
Introduced on 2007/11/11, but now obsolete due to
the
cods
array, added later on this class.
CHTAG
ChunkTag CHTAG
FREQ
long FREQ
serialVersionUID: -5798479126800064641L
WL
Word[] WL
WR
Word[] WR
WX
Word[][] WX
POST
POSType POST
rand
Random rand
serialVersionUID: -2118567303945736768L
FORMAT
int FORMAT
postype
POSType postype
M
int M
ENCODE
String ENCODE