hultig.io
Class FileNewsCluster

java.lang.Object
  extended by java.io.File
      extended by hultig.io.FileNewsCluster
All Implemented Interfaces:
Serializable, Comparable<File>

public class FileNewsCluster
extends File

This class was designed to handle a web news files, which are XML data files containing news stories extracted from the web. The news are stored in clusters of related stories. Therefore, the general structure of such a file is illustrated below:

    <news-clusters>
       <cluster i="1" url="http://news.google.com/...">
          <new i="1" url="...">
             Wall Street stocks began the final week of one of their worst
             years...
          </new>
          ...
       </cluster>
       ...
       ...
    </news-clusters>

 
Each cluster is sequentially identified and contains the URL of its source, as well as each new story.
(18:22:07, 16, February, 2009)

See Also:
Serialized Form

Field Summary
 
Fields inherited from class java.io.File
pathSeparator, pathSeparatorChar, separator, separatorChar
 
Constructor Summary
FileNewsCluster(String fpath)
          The default constructor.
 
Method Summary
static String cleanSentence(String s)
          Cleans a sentence string from certain extra/meta symbols, like HTML/XML tags.
 CorpusIndex getDictionary()
          Gives the reference to the corpus index used in this object.
 NewsCluster getNewsCluster(int index)
           
 ArrayList<NewsCluster> getNewsClusters()
          Gives the list of news clusters in this object.
 Sentence[] getNewsClusterSentences(int index)
          Gives the set of sentences contained in the i-th news cluster, from this object.
 int getNumClusters()
          Gives the number of clusters of web news stories loaded.
 Sentence[] loadAllSentences()
           
 boolean loadClusters()
          Loads news clusters contained in a given file.
static void main(String[] args)
          Demonstrates the class main operators, including the load and manipulation of web news stories.
 boolean passfilter(String line)
          Defines a filter to be applied to the text, preventing certain exotic or uninteresting strings to be rejected, as for example lines with less than 5 characters, or sentences with less than three words.
 boolean readCluster(BufferedReader br, NewsCluster cluster)
          Reads a given news cluster, from the current file reader (BufferedReader).
 
Methods inherited from class java.io.File
canExecute, canRead, canWrite, compareTo, createNewFile, createTempFile, createTempFile, delete, deleteOnExit, equals, exists, getAbsoluteFile, getAbsolutePath, getCanonicalFile, getCanonicalPath, getFreeSpace, getName, getParent, getParentFile, getPath, getTotalSpace, getUsableSpace, hashCode, isAbsolute, isDirectory, isFile, isHidden, lastModified, length, list, list, listFiles, listFiles, listFiles, listRoots, mkdir, mkdirs, renameTo, setExecutable, setExecutable, setLastModified, setReadable, setReadable, setReadOnly, setWritable, setWritable, toString, toURI, toURL
 
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
 

Constructor Detail

FileNewsCluster

public FileNewsCluster(String fpath)
The default constructor.

Method Detail

getNewsClusters

public ArrayList<NewsCluster> getNewsClusters()
Gives the list of news clusters in this object. Each news cluster contains a list of related news stories.

Returns:
The list of news clusters.

getNewsCluster

public NewsCluster getNewsCluster(int index)

getNewsClusterSentences

public Sentence[] getNewsClusterSentences(int index)
Gives the set of sentences contained in the i-th news cluster, from this object.

Parameters:
index - The i-th news cluster.
Returns:
An array with all sentences from a given cluster.

loadAllSentences

public Sentence[] loadAllSentences()

getDictionary

public CorpusIndex getDictionary()
Gives the reference to the corpus index used in this object.

Returns:
A corpus index reference.

getNumClusters

public int getNumClusters()
Gives the number of clusters of web news stories loaded.

Returns:
The number of clusters loaded.

loadClusters

public boolean loadClusters()
Loads news clusters contained in a given file. The loaded clusters are stored in VCLUSTERS.

Returns:
The true value if the loading process succeeds, and false otherwise.

readCluster

public boolean readCluster(BufferedReader br,
                           NewsCluster cluster)
                    throws Exception
Reads a given news cluster, from the current file reader (BufferedReader).

Parameters:
br - The file reader from which the news cluster should be read.
cluster - An output parameter with the read news clusters.
Returns:
boolean The true value if the loading process succeeds, and false otherwise.
Throws:
Exception

passfilter

public boolean passfilter(String line)
Defines a filter to be applied to the text, preventing certain exotic or uninteresting strings to be rejected, as for example lines with less than 5 characters, or sentences with less than three words.

Parameters:
line - The string to be tested.
Returns:
boolean The true value if the input string passes the test, false otherwise.

cleanSentence

public static String cleanSentence(String s)
Cleans a sentence string from certain extra/meta symbols, like HTML/XML tags.

Parameters:
s - The input sentence string.
Returns:
The cleaned sentence string. (8, August, 2008)

main

public static void main(String[] args)
Demonstrates the class main operators, including the load and manipulation of web news stories.

Parameters:
args - No parameters are expected.