SUMMA

 

Overview

SUMMA is a toolkit for the development of text summarization systems.

 

 


 

SUMMA Simple Multi-document Summarizer

Functionality

This component generates the summary of a set of documents. It combines features producing a score, ranks sentences by score, and annotates top ranked sentences. It also generates a stand-alone summary in the user interface. The score should include for example a multidocument feature such as the similarity of the sentence to the centroid of the document. The component also annotates in each individual document the sentences contributing to the multidocument summary.

Parameters of the Resource

  • annSet: the annotation set where the sentence annotations and other relevant annotatios live
  • compression: an integer value representing either a percent of sentences to extrat or an absolut number of words to extract from the document.
  • corpus: the corpus with the documents to summarize
  • newDocument: a boolean idicating if you want a new document gerenated for your summary
  • removeRedumdancy: a boolean indicating if you want to remove redundancy using cosine similarity
  • sentAnn: the name of the sentence annotation where the features for scoring are going to be found and where the final score will be stored.
  • sentCompression: a boolean indicating if sentence compression is going to be used (e.g. proportion of sentences to extract) or absolute number of words.
  • sumFeatures: the features to include in the computation of the score
  • sumWeigths: the weights to be used to combine the features
  • tokenAnn: the token annotation to count number of words for computing compression
  • sumSetName: the name of the annotation where sentences are going to be annotated for the summary.
  • thresholdSim: a double indicating the threshold similarity to explude a sentence from the summary
  • vectorName: the vector name to compute similarities

Restriction

This resource should be used in a GATE pipeline, it does not make sense to use it in a Corpus Pipeline! Features should have been computed for each sentence in the document including some multidocument feature. The documents should also contain vectors for sentences so that redundancy removal can be implemented. The component assumes that the order of documents in the corpus is the order in which sentences should be presented (sentences from first document first, sentences from last document last).

 

 

 

 

 

Copyright 2002-2014 Universitat Pompeu Fabra