hultig.sumo
Class TxtFilter

java.lang.Object
  extended by hultig.sumo.TxtFilter

public class TxtFilter
extends Object

NOT YET WELL COMMENTED.


Field Summary
 int MINLEN
          The string minimum length.
 int MINWORDS
          The minimum number of words allowed, by line.
 byte NUMTAG
          If greater than 0, activates number tagging, i.e every number will be replaced by the tag.
 int numtokens
          The number of tokens, found on the last string processed (method procWords/1).
 int numUpperWords
           
 int[] NUMWORDS
          Number of words with 1, 2, ..., 10 NUMWORDS[0] will contain the total number of words.
 byte OFF
          A boolean false value defined.
 byte ON
          A boolean true value defined.
 
Constructor Summary
TxtFilter()
          Default constructor.
 
Method Summary
 String filtering(String line)
          Filtering an input string uppon a bunch of rules.
 double getPercUpperWords()
           
 int[] getWordHistogram()
           
static boolean isLetter(byte c)
           
static void main(String[] args)
          The Main class.
 double probBeText()
           
 double probBeText(String s)
           
 double probGoodText(String s)
          Estimates the probability of s to be "good text".
 boolean process(String file)
          A shortcut for the process/2 method.
 boolean process(String file, String fout)
          Process a file by applying a set of filtering rules.
 boolean procstr(String s)
          Processes a string and setup a bunch of state variables, like number of characters, number of upper and lower characters, probability to be a "good text", etc.
 boolean procWords(String s)
           
 boolean satisfySpecialRules(String sx)
          Verifies whether a given string satisfies a number of text rules.
 boolean satisfyWordConstraints(double pwords, int[] reqhistogram)
          Test a set of constraints.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

ON

public byte ON
A boolean true value defined.


OFF

public byte OFF
A boolean false value defined.


NUMTAG

public byte NUMTAG
If greater than 0, activates number tagging, i.e every number will be replaced by the tag.


MINLEN

public int MINLEN
The string minimum length.


MINWORDS

public int MINWORDS
The minimum number of words allowed, by line.


NUMWORDS

public int[] NUMWORDS
Number of words with 1, 2, ..., 10 NUMWORDS[0] will contain the total number of words. leters.


numtokens

public int numtokens
The number of tokens, found on the last string processed (method procWords/1).


numUpperWords

public int numUpperWords
Constructor Detail

TxtFilter

public TxtFilter()
Default constructor.

Method Detail

isLetter

public static boolean isLetter(byte c)

procstr

public boolean procstr(String s)
Processes a string and setup a bunch of state variables, like number of characters, number of upper and lower characters, probability to be a "good text", etc.

Parameters:
s - The string to be processed.
Returns:
true if successful and false otherwise.

procWords

public boolean procWords(String s)

probBeText

public double probBeText()

probBeText

public double probBeText(String s)

getWordHistogram

public int[] getWordHistogram()

getPercUpperWords

public double getPercUpperWords()

satisfyWordConstraints

public boolean satisfyWordConstraints(double pwords,
                                      int[] reqhistogram)
Test a set of constraints.

Parameters:
pwords - The minimum word percentage.
reqhistogram - Requested word histogram satisfaction.
Returns:
true if all conditions are satisfied.

probGoodText

public double probGoodText(String s)
Estimates the probability of s to be "good text". By "good text" we mean that it contains some message written in a natural western language, like English.

Parameters:
s - String
Returns:
A probability value in the interval [0,1]

satisfySpecialRules

public boolean satisfySpecialRules(String sx)
Verifies whether a given string satisfies a number of text rules.

Parameters:
sx - The input string or the string to be testes.
Returns:
true if successful and false otherwise.

process

public boolean process(String file)
A shortcut for the process/2 method.

Parameters:
file - The file name to be processed.
Returns:
true if no problems during processing, false otherwise.

process

public boolean process(String file,
                       String fout)
Process a file by applying a set of filtering rules.

Parameters:
file - The input file name.
fout - The output file name, if it is null the output will be produced to the standard output.
Returns:
true if no problems during processing, false otherwise.

filtering

public String filtering(String line)
Filtering an input string uppon a bunch of rules.

Parameters:
line - String
Returns:
The input string after filtered.

main

public static void main(String[] args)
The Main class.

Parameters:
args - The array with the input arguments