[kauchak & Barzilay 2006] - Paraphrasing for Automatic Evaluation

This paper studies the impact of paraphrases on the accuracy of automatic evaluation. Given a reference sentence and a machine-generated sentence, we seek to find a paraphrase of the reference sentence that is closer in wording to the machine output than the original reference. We apply our paraphrasing method in the context of machine translation evaluation. Our experiments show that the use of a paraphrased synthetic reference refines the accuracy of automatic evaluation. We also found a strong connection between the quality of automatic paraphrases as judged by humans and their contribution to automatic evaluation. [PDF]

[Dolan et al. 2004] - Unsupervised Construction of Large Paraphrase Corpora: Exploiting Massively Parallel News Sources

We investigate unsupervised techniques for acquiring monolingual sentence-level paraphrases from a corpus of temporally and topically clustered news articles collected from thousands of web-based news sources. Two techniques are employed: (1) simple string edit distance, and (2) a heuristic strategy that pairs initial (presumably summary) sentences from different news stories in the same cluster. We evaluate both datasets using a word alignment algorithm and a metric borrowed from machine translation. Results show that edit distance data is cleaner and more easily-aligned than the heuristic data, with an overall alignment error rate (AER) of 11.58% on a similarly-extracted test set. On test data extracted by the heuristic strategy, however, performance of the two training sets is similar, with AERs of 13.2% and 14.7% respectively. Analysis of 100 pairs of sentences from each set reveals that the edit distance data lacks many of the complex lexical and syntactic alternations that characterize monolingual paraphrase. The summary sentences, while less readily alignable, retain more of the non-trivial alternations that are of greatest interest learning paraphrase relationships. [PDF]

[Zajic et Al. 2004] - BBN/UMD at DUC-2004: Topiary

Abstract: This paper reports our results at DUC-2004 and describes our approach, implemented in a system called Topiary. We will show that the combination of linguistically motivated sentence compression with statistically selected topic terms performs better than either alone, according to some automatic summary evaluation measures. [PDF]

[Vandeghinste & Pan 2004] - Sentence Compression for Automated Subtitling: A Hybrid Approach

Abstract: In this paper a sentence compression tool is described. We describe how an input sentence gets analysed by using a.o. a tagger, a shallow parser and a subordinate clause detector, and how, based on this analysis, several compressed versions of this sentence are generated, each with an associated estimated probability. These probabilities were estimated from a parallel transcript/subtitle corpus. To avoid ungrammatical sentences, the tool also makes use of a number of rules. The evaluation was done on three different pronunciation speeds, averaging sentence reduction rates of 40% to 17%. The number of reasonable reductions ranges between 32.9% and 51%, depending on the average estimated pronunciation speed. [PDF]

[NGUYEN et Al. 2004] - Example-Based Sentence Reduction Using the Hidden Markov Model

Abstract: Sentence reduction is the removal of redundant words or phrases from an input sentence by creating a new sentence in which the gist of the original meaning of the sentence remains unchanged. All previous methods required a syntax parser before sentences could be reduced; hence it was difficult to apply them to a language with no reliable parser. In this article we propose two new sentence-reduction algorithms that do not use syntactic parsing for the input sentence. The first algorithm, based on the template-translation learning algorithm, one of example-based machine-translation methods, works quite well in reducing sentences, but its computational complexity can be exponential in certain cases. The second algorithm, an extension of the template-translation algorithm via innovative employment of the Hidden Markov model, which uses the set of template rules learned from examples, can overcome this computation problem. Experiments show that the proposed algorithms achieve acceptable results in comparison to sentence reduction done by humans. [PDF]

[Mallett et Al. 2004] - Information-Content Based Sentence Extraction for Text Summarization

Abstract: This paper proposes the FULL-COVERAGE summarizer: an efficient, information retrieval oriented method to extract non-redundant sentences from text for summarization purposes. Our method leverages existing Information Retrieval technology by extracting key-sentences on the premise that the relevance of a sentence is proportional to its similarity to the whole document. We show that our method can produce sentence-based summaries that are up to 78% smaller than the original text with only 3% loss in retrieval performance. [PDF]

[Harman & Hover 2004] - The Effects of Human Variation in DUC Summarization Evaluation

There is a long history of research in automatic text summarization systems by both the text retrieval and the natural language processing communities, but evaluation of such systems’ output has always presented problems. One critical problem remains how to handle the unavoidable variability in human judgments at the core of all the evaluations. Sponsored by the DARPA TIDES project, NIST launched a new text summarization evaluation effort, called DUC, in 2001 with follow-on workshops in 2002 and 2003. Human judgments provided the foundation for all three evaluations and this paper examines how the variation in those judgments does and does not affect the results and their interpretation. [PDF]

[Daelemans et Al. 2004] - Automatic Sentence Simplification for Subtitling in Dutch and English

We describe ongoing work on sentence summarization in the European MUSA project and the Flemish ATraNoS project. Both projects aim at automatic generation of TV subtitles for hearing-impaired people. This involves speech recognition, a topic which is not covered in this paper, and summarizing sentences in such a way that they fit in the available space for subtitles. The target language is equal to the source language: Dutch in ATraNoS and English in MUSA. A separate part of MUSA deals with translating the English subtitles to French and Greek. We compare two methods for monolingual sentence length reduction: one based on learning sentence reduction from a parallel corpus and one based on hand-crafted deletion rules. [PDF]

[Barzilay & Lee 2003] - Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment

We address the text-to-text generation problem of sentence-level paraphrasing - a phenomenon distinct from and more difficult than word- or phrase-level paraphrasing. Our approach applies multiple-sequence alignment to sentences gathered from unannotated comparable corpora: it learns a set of paraphrasing patterns represented by word lattice pairs and automatically determines how to apply these patterns to rewrite new sentences. The results of our evaluation experiments show that the system derives accurate paraphrases, outperforming baseline systems. [PDF]   [slides]

[Silber & McCoy 2003] - Efficiently Computed Lexical Chains As an Intermediate representation for Automatic Text Summarization

While automatic text summarization is an area that has received a great deal of attention in recent research, the problem of efficiency in this task has not been frequently addressed. When considering the size and quantity of documents available on the Internet and from other sources, the need for a highly efficient tool that produces usable summaries is clear. We present a linear time algorithm for lexical chain computation. The algorithm makes lexical chains a computationally feasible candidate as an intermediate representation for automatic text summarization. A method for evaluating lexical chains as an intermediate step in summarization is also presented and carried out. Such an evaluation was not possible before due to the computational complexity of previous lexical chains algorithms. [PDF]

[Rino & Prado 2003] - A Sumarização Automática de Textos: Principais Características e Metodologias

Automatic Summarization aims at simulating the main features of human summarizing, to know: to identify relevant text segments and put them together into the corresponding summaries. Summaries, in this context, are simply condensed texts of a source text. There are diverse Automatic Summarization models, which use either linguistic knowledge or statistical or empirical information. The composition of relevant information depends on the modeling process: it may be corresponding to a fully rewriting of the summary, similarly to what humans do, or it may be just a simple selection of segments and their literal reproduction as juxtaposed summary units. In this chapter, we present such diversity, illustrating both approaches by describing some automatic summarizers that have been developed at NILC. [PDF]

[Riezler et Al. 2003] - Statistical Sentence Condensation using Ambiguity Packing and Stochastic Disambiguation Methods for Lexical-Functional Grammar

We present an application of ambiguity packing and stochastic disambiguation techniques for Lexical-Functional Grammars (LFG) to the domain of sentence condensation. Our system incorporates a linguistic parser/generator for LFG, a transfer component for parse reduction operating on packed parse forests, and a maximum-entropy model for stochastic output selection. Furthermore, we propose the use of standard parser evaluation methods for automatically evaluating the summarization quality of sentence condensation systems. An experimental evaluation of summarization quality shows a close correlation between the automatic parse-based evaluation and a manual evaluation of generated strings. Overall summarization quality of the proposed system is state-of-the-art, with guaranteed grammaticality of the system output due to the use of a constraint-based parser/generator. [PDF]

[Prado et Al. 2003] - GistSumm: A Summarization Tool Based on a New Extractive Method

This paper presents a new extractive approach to automatic summarization based on the gist of the source text. The gist-based system, called GistSumm (GIST SUMMarizer), uses the gist as a guideline to identify and select text segments to include in the final extract. Automatically produced extracts have been evaluated under the light of gist preservation and textuality. [PDF]

[Lin 2003] - Improving Summarization Performance by Sentence Compression - A Pilot Study

In this paper we study the effectiveness of applying sentence compression on an extraction based multi-document summarization system. Our results show that pure syntactic - based compression does not improve system performance. Topic signature - based reranking of compressed sentences does not help much either. However reranking using an oracle showed a significant improvement remains possible. [PDF]

[Le & Horiguchi 2003] - A Sentence Reduction Using Syntax Control

This paper present a method based on the behaviour of non-native speaker for reduction sentence in foreign language. We demonstrate an algorithm using semantic information in order to produce two reduced sentences in two difference languages and ensure both grammatical and sentence meaning of the original sentence in reduced sentences. In addition, the orders of reduced sentences are able to be different from original sentences. [PDF]


This paper proposes a new automatic speech summarization method having two stages: important sentence extraction and sentence compaction. Relatively important sentences are extracted from the results of large-vocabulary continuous speech recognition (LVCSR) based on the amount of information and the confidence measures of constituent words. The set of extracted sentences is compressed by our sentence compaction method. Sentence compaction is performed by selecting a word set that maximizes a summarization score which comprises the amount of information and the con- fidence measure of each word, the linguistic likelihood of word strings, and the word concatenation probability. The selected words are concatenated to create a summary. Effectiveness of the proposed method was confirmed by testing summarization of spontaneous presentations. Optimal ratio of sentence extraction to sentence compaction changes according to the target summarization ratio and features of presentations. [PDF]

[Carlos et Al. 2003] - A Non-Linear Topic Detection Method for Text Summarization Using Wordnet

This paper deals with the problem of automatic topic detection in text documents. The proposed method follows a non-linear approach. The method uses a simple clustering algorithm to group the semantically-related sentences. The distance between two sentences is calculated based on the distance between all nouns that appear in the sentences. The distance between two nouns is calculated using the Wordnet thesaurus. An automatic text summarization system using a topic strength method was used to compare the results achieved by the Text Tiling Algorithm and the proposed method. The obtained initial results shows that the proposed method is a promising approach. [PDF]

[Knight & Maru 2002] - Summarization beyond sentence extraction: A probabilistic approach to sentence compression

When humans produce summaries of documents, they do not simply extract sentences and concatenate them. Rather, they create new sentences that are grammatical, that cohere with one another, and that capture the most salient pieces of information in the original document. Given that large collections of text/abstract pairs are available online, it is now possible to envision algorithms that are trained to mimic this process. In this paper, we focus on sentence compression, a simpler version of this larger challenge. We aim to achieve two goals simultaneously: our compressions should be grammatical, and they should retain the most important pieces of information. These two goals can conflict. We devise both a noisy-channel and a decision-tree approach to the problem, and we evaluate results against manual compressions and a simple baseline. [PDF]

[Zha 2002] - Generic Summarization and Keyphrase Extraction Using Mutual Reinforcement Principle and Sentence Clustering

A novel method for simultaneous keyphrase extraction and generic text summarization is proposed by modeling text documents as weighted undirected and weighted bipartite graphs. Spectral graph clustering algorithms are used for partitioning sentences of the documents into topical groups with sentence link priors being exploited to enhance clustering quality. Within each topical group, saliency scores for keyphrases and sentences are generated based on a mutual reinforcement principle. The keyphrases and sentences are then ranked according to their saliency scores and selected for inclusion in the top keyphrase list and summaries of the document. The idea of building a hierarchy of summaries for documents capturing different levels of granularity is also briefly discussed. Our method is illustrated using several examples from news articles, news broadcast transcripts and web documents. [PDF]

[Euler 2002] - Tailoring Text Using TopicWords: Selection and Compression

In the context of unified messaging, a textual message may have to be reduced in length for display on certain mobile devices. This paper presents a new method to extract sentences that deal with a certain topic from a given text. The approach is based on automatically computed lists of words that represent the desired topics. These word lists also give semantic hints on how to shorten sentences, extending previous methods that rely on syntactical clues only. The method has been evaluated for extraction accuracy and by human subjects for informativeness of the resulting extracts. [PDF]

[Barzilay & Mckeown 2001] - Extracting Paraphrases from a Parallel Corpus

While paraphrasing is critical both for interpretation and generation of natural language, current systems use manual or semi-automatic methods to collect paraphrases. We present an unsupervised learning algorithm for identification of paraphrases from a corpus of multiple English translations of the same source text. Our approach yields phrasal and single word lexical paraphrases as well as syntactic paraphrases. [PDF]

[Jing 2000] - Sentence Reduction for Automatic Text Summarization

We present a novel sentence reduction system for automatically removing extraneous phrases from sentences that are extracted from a document for summarization purpose. The system uses multiple sources of knowledge to decide which phrases in an extracted sentence can be removed, including syntactic knowledge, context information, and statistics computed from a corpus which consists of examples written by human professionals. Reduction can significantly improve the conciseness of automatic summaries. [PDF]

[Witbrock & Mittal 1999] - Ultra-Summarization: A Statistical Approach to Generating Highly Condensed Non-Extractive Summaries

Using current extractive summarization techniques, it is impossible to produce a coherent document summary shorter than a single sentence, or to produce a summary that conforms to particular stylistic constraints. Ideally, one would prefer to understand the document, and to generate an appropriate summary directly from the results of that understanding. Absent a comprehensive natural language understanding system, an approximation must be used. This paper presents an alternative statistical model of a summarization process, which jointly applies statistical models of the term selection and term ordering process to produce brief coherent summaries in a style learned from a training corpus. [PDF]

[Chandrasekar et Al. 1996] - Motivations and Methods for Text Simplification

Long and complicated sentences prove to be a stumbling block for current systems relying on NL input. These systems stand to gain from methods that syntactically simplify such sentences. To simplify a sentence, we need an idea of the structure of the sentence, to identify the components to be separated out. Obviously a parser could be used to obtain the complete structure of the sentence. However, full parsing is slow and prone to failure, especially on complex sentences. In this paper, we consider two alternatives to full parsing which could be used for simplification. The first approach uses a Finite State Grammar (FSG) to produce noun and verb groups while the second uses a Supertagging model to produce dependency linkages. We discuss the impact of these two input representations on the simplification process. [PDF]

(JPC) - Last update: 2005/03/19