A place to manage, share and publish items from the field of Automatic Text Summarization / Sentence Compression.

When CERN researcher Tim Berners Lee thought for the first time about an hypermedia system, in the beginning of the 90's, he had in mind a tool through which researchers could publish and share their work with the community. This dream was the propellant for the creation of the World Wide Web, which has indeed revolutionized human information interchanging.

The purpose of this page is also to organize and share our research work, being a communication point between ourselves and the community. We believe that communication and sharing is vital in scientific research. Thus, this web page contains a set of links and papers in our research area, including our own publications, experimentation data, research tools, dates of relevant events, among other things.

About Myself


Text Summarization, Sentence Compression, Paraphrasing, Ontology, Web Search, Text Mining, Information Retrieval, Information Extraction and Machine Learning.

Short CV

João Cordeiro was graduated from the University of Beira Interior in Applied Mathematics, in 1998. He has a Master Degree in Artificial Intelligence and Computer Science in the specific field of Text Mining and Extraction, obtained in 2003, from the Faculty of Science - University of Porto (Portugal). Since January 2005 he is a PhD student in Computer Science in the specific area of Natural Language Processing, at the University of Beira Interior (Portugal).

He is an Assistant Lecturer in the Department of Computer Science of the University of Beira Interior, where he teach subjects related to Programming, Algorithms and Artificial Intelligence.

He is a researcher at the of Human Language Technology and Bioinformatics (HULTIG) of the University of Beira Interior and also a researcher at the NIAAD in Porto (Portugal).

He is a member of the ACL and a member of the Portuguese Association of Artificial Intelligence (APPIA).


The collection of literature considered relevant, divided in three main categories. An abstract preview is presented for each paper article and a link give access to the pdf document. This is my electronic literature archive and was created and is maintained only with the purpose of self-organization. Every PDF was downloaded from publicly available web sites, in the most cases directly from the authors web-page.

More Related
Contains the most related bibliographic material read
More General
Some general literature considered
Collection of tutorials items.


ACL: 2004 2005 2006 2007
ECML/PKDD: 2004 2005 2006 2007
RANLP: 2003 2005 2007

Data Sets

Rule Applicability - New Version (2011)

Several results obtained through different versions of reduction rule sets. These were obtained by taking different learning options, as for example whether omitting or not the use of shallow parsing, in the rule induction process. In the pursuit of a refined rule set, holding higher accuracy, we have noticed that the use of shallow parsing features tend to give rise to a significantly number of "too general" reduction rules, depending almost exclusively on these features. The next file contains the "rule applicability" on such a case: [RuleAppBase.html]. By ignoring shallow parsing we obtain a rule set much more specific, seeming much well tailored for less reduction errors. In the next file we can see the kind of rules that most fire, from this chunkless rule set: [RuleAppChunkLess.html]. In both files, rules are ordered by the rule firing frequency.

The following files contains samples of the results obtained from the application of a chunkless rule set to a collection of web news stories. Each file corresponds to a different rule application strategy, e.g. apply only the best rule, or using bootstrap. The first three files are just samples with 100 reduction cases each.

The complete set of chunkless rules, used on generating the reductions shown in the previous list, is available here: [R04.txt].

Rule Application Sample Files (2010)

A set of files showing the application of sentence compression rules, on random selected news sentences. These files are labeled according to some specific experimentation and so the file name codifies a given particular experience, which will be referred in the literature.

| T0 | TA | TB | TC | TD | TF | TG [The whole set here 495KB]

Bubble Extraction Data (2009)

We supply here the bubble sets utilized to induce the sentence reduction rules, reported in our [Cordeiro et al. 2009] publication. First one may download a zip file containing a text file showing the extracted bubbles for a set of aligned paraphrases. All paraphrases are also listed even those from which no bubbles were extracted. This shows the kind of bubbles we have selected for our experience, exactly those satisfying the conditions mentioned in the article. The file downloadable here: [bubs_a.zip 4.3 MB].

All the three bubble sets employed in the sentence reduction rule induction may be downloaded from the next link, which is a zip file with three prolog data files, one for each bubble size type: 1, 2, and 3. In these files a bubble is represented through a prolog term (bub/5), complying the description made in [Cordeiro et al. 2009]. Download it here: [bubs123.zip 2.5 MB].

Paraphrase Alignment Data with POS Chunking (2008)

A collection of paraphrases dynamically aligned and with color chunking marks on each sentence. Sentences were chunked with two shallow parsers: "MontyLingua" and "OpenNLP": [alg-chunks-chrom.html 1.4MB]

Paraphrase Alignment Data (2007)

Some alignment samples, generated with optimal "Needleman Wunsch" and "Smith Waterman" algorithms, with dynamic algorithm choosing, as described in [8].

[DUC2002 sample]
[WNS small sample] [WNS complete sample 6MB]
An XML file with 93572 paraphrase sentence pairs, dynamically aligned, by using Smith Waterman (SW) and Needleman Wunsch (NW) algorithms [here 6MB]

Paraphrase Clustering (2007)

Three samples of paraphrase clusters, generated automatically, are available here. By paraphrase cluster we mean a set of sentences where each sentence pair constitute a paraphrase pair. We admit symmetrical and asymmetrical entailed pairs as paraphrases. In an asymmetrical pair one sentence entails the other one but the other does not entail the first one. These samples were generated by different clustering algorithms and the examples are randomly mixed. This data sets were created to be humanly evaluated - for each cluster should be classified as "correct" or "incorrect", by each human judge. A negative label should be given when to a cluster that contains at least two sentence without any entailment relationship among them. The cluster sentences were generated from a collection of "News Stories", automatically extracted from the Web.

[sample a] [sample b] [sample c]

Paraphrase Identification/Extraction (2006)

The following data sets are related to paraphrase identification/extraction experiences made, which were reported in some articles. The material supplied here enable every one in the community to reconstruct the original data sets used in our experiences. Some data sets are incomplete because they contain subparts, which are not publicly available, provided by other authors or organizations. Those subparts should be requested directly from the original sources.

  1. The {MSRPC ∪ X1999} Corpus
  2. The {KMC ∪ X1087} Corpus
  3. The {MSRPC(+) ∪ KMC ∪ X4987} Corpus
  4. The {MSRPC(+) ∪ KMC ∪ X4987} Corpus, without "quasi-equal" negative pairs.

In the previous list, KMC stands for "Knight and Marcu Corpus" and contains a collection of 1087 asymmetrical paraphrases, used by the authors in their "Sentence Compression" research work [pdf]. This corpus should be obtained directly from the authors. Each "X<NUMBER>" subpart contains a set of "NUMBER" negative paraphrase pairs, randomly selected from related web news stories. These negative subparts were added for the sake of corpus balancing. The MSRPC(+) subpart designate the subset of positive paraphrase pairs, from the "Microsoft Research Paraphrase Corpus" (MSRPC).

For each line from each file, there are 5 elements separated by the "TAB" character ('\t'). The first three elements are just integer values indicating sentence indexes, on the original text, and pair type (the first value: 1 mean positive and 0 mean negative pair). The remainder two sentences constitute the paraphrase pair. By "positive" and "negative" we mean true or false paraphrase pair.


[12] - Cordeiro, J.P., Dias, G., Brazdil, P., (2013). Rule Induction for Sentence Reduction. Progress in Artificial Intelligence, 16th Portuguese Conference on Artificial Intelligence, EPIA 2013, LNAI 8154, pp. 528--539, Springer Verlag 2013.

[11] - Dias, G., Moraliyski, R., Cordeiro, J.P., Doucet, A., Ahonen-Myka, H. (2010). Automatic Discovery of Word Semantic Relations using Paraphrase Alignment and Distributional Lexical Semantics Analysis. In Journal of Natural Language Engineering. Special Issue on Distributional Lexical Semantics. (Guest Eds) Roberto Basisli   Marco Pennacchiotti. Volume 16, issue 04, pp. 439-467. Cambridge University Press. ISSN 1351-3249.

[10] - Grigonyté, G., Cordeiro, J.P., Moraliyski, R., Dias, G.,   Brazdil, P. (2010). A Paraphrase Alignment for Synonym Evidence Discovery. 23rd International Conference on Computational Linguistics (COLING 2010). Beijing, China, August 23-27.

[9] - Cordeiro, J.P., Dias, G.   Brazdil, P. (2009). Unsupervised Induction of Sentence Compression Rules. In Proceedings of the Workshop on Language Generation and Summarization associated to the Joint Conference of the 47th Annual Meeting of the Association for Computational Linguistics and the 4th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing (ACL/IJCNLP 2009). Singapore, Singapore, August 6.

[8] - Cordeiro, J.P., Dias, G.   Cleuziou G.   Brazdil P. (2007). New Functions for Unsupervised Asymmetrical Paraphrase Detection. In Journal of Software. Volume:2, Issue:4, Page(s): 12-23. Academy Publisher. Finland. ISSN: 1796-217X. October 2007.

[7] - Cordeiro, J.P., Dias, G.   Cleuziou G. (2007). Biology Based Alignments of Paraphrases for Sentence Compression. In Proceedings of the Workshop on Textual Entailment and Paraphrasing (ACL-PASCAL / ACL2007). Prague, Czech Republic.

[6] - Cordeiro, J.P., Dias, G.   Brazdil, P. (2007). Learning Paraphrases from WNS Corpora. 20th International FLAIRS Conference. AAAI Press. Key West, Florida, USA.

[5] - Cordeiro, J.P., Dias, G.   Brazdil, P. (2007). A Metric for Paraphrase Detection. 2nd International Multi-Conference on Computing in the Global Information Technology. IEEE Computer Society Press. Guadeloupe, France.

[4] - Dias, G., Nunes C., Cordeiro J.P., Moraliyski, R. Marcelino, I., Mukelov Raycho, Campos R., Santos, C., Alves, E., Conde, Bruno and Nonchev B. (2006). Language Independent Methodologies to Tackle Multilinguality. In Readings in Multilinguality. Selected Papers from Young Researchers in BIS-21++. Galia Angelova, Kiril Simov, Milena Slavcheva (Editors). Incoma Ltd. Shoumen, Bulgaria.

[3] - Alexandre, L. , Pereira M., Madeira C. S., Cordeiro J.P.   Dias, G. (2004). Web Image Indexing: Combining Image Analysis with Text Processing. In proceedings of the 5th International Workshop on Image Analysis for Multimedia Interactive Services, Instituto Superior Técnico, Lisbon, Portugal. April 21-23. CD-Version. ISBN: 9729811571.

[2] - Cordeiro and Brazdil (2004). Learning Text Extraction Rules Without Ignoring Stop Words. 4th International Workshop on Pattern Recognition in Information Systems (PRIS 2004). Porto, Portugal.

[1] - Cordeiro, João Paulo da Costa. Extracção de elementos relevantes em texto-páginas da World Wide Web [Texto policopiado] / João Paulo da Costa Cordeiro. - Porto : [s.n.], 2003. - 174 p. : il. ; 30 cm http://purl.pt/6320 - Bibliografia, f. 149-153. - Tese mestr. Inteligência Artificial e Computação, Univ. do Porto, 2003 (BND 1290590)