Text summarization (TS)  and Information Extraction (IE) are key information access technologies the adaptation of which depend on the availability of language resources such as domain corpora.

We have created the CONCISUS Corpus for  the study of multilingual and cross-lingual information extraction in Spanish and English.

The CONCISUS Corpus is  an annotated dataset of comparable Spanish and English event summaries in four application domains.  Event summaries are important text types to investigate for the following reasons:

First,  event summaries can always be found on the Web and in newspaper collections;

Second, event summaries as those we study here are rather  concise, therefore being of interest for automatic text generation applications such as non-extractive summarization;

Lastly, the summaries we have collected contain the key/essential information of the reported events, therefore being of value for manual or automatic domain modeling.

The CONCISUS Corpus covers for the time being the following domains: aviation accidents, train accidents, earthquakes, and terrorist attacks.

The dataset contains: comparable summaries, comparable automatic translations, and comparable full documents.
An example of comparable summaries in the aviation accident domain is shown below:

2008 January 17 - British Airways Flight 38, a Boeing 777-200ER, lands short of the runway at London Heathrow Airport in the United Kingdom. Nine of the 152 people on board are treated for minor injuries, but there are no fatalities; this is the first loss of a Boeing 777.
2008 17 de enero: el Vuelo 38 de British Airways (Boeing 777) sufrió un accidente al tomar tierra en el Aeropuerto de Londres-Heathrow  procedente de Pekín. No hubo víctimas mortales.

An example of comparable summaries in the terrorist attack domain is shown below:

Monday, February 19, 2007. Around midnight on Sunday, a pair of bombs exploded on the Samjhauta Express (Friendship Express), a night train going from Delhi, India to Lahore, Pakistan. At least 68 fatalities have been reported. Two more suitcases with improvised explosive devices have been found on the train.  Some 13 passengers were reported injured, some with severe burns.
19 de febrero de 2007: Fallecen 66 personas y más de 60 resultan heridas a consecuencia de la explosión de dos bombas en un tren que enlaza la India con Pakistán.




