CUTTING EDGE
Text Summarization

This page was last updated on July 20, 2004


Cuttingedge Online 
      Home
      Project Page
      Sitemap

 Downloads                  
      Version 1.1
      Help

 Project Brief             
      About Project
     
Design details
      Presentations
Future Work

 Interact!                      
      Contact Us
      Discussion Forum
      Bug Reports

 Developers                 
Programmers

    Design Details...

A look at the modules and their algorithmic details, with deadlines

 

Module 1:      Accepting Input  

  •   The input is technical_article.txt 

  •   Input in the text file has to be processed before proceeding to the segmentation.

  •   Processing has to be done in order to:

                          -  Identify paragraph breaks
                          -  Identify sentence breaks
                          -  Output should be one word per line in order to be tagged by fnTBL-1.0

  •   This file is saved as processed.txt

 

 Module 2:      fnTBL interface to Python 

  •    Accept contents of processed.txt  

  •  Use fnTBL-1.0 to tag it in processed.tagged

  •  Use fnTBL-1.0 to chunk it in processed.chunked

  •   Extract nouns + noun compounds

  •   Create a list of noun-tokens found for each pragraph

 

Module 3:      Wordnet interface to Python 

3.1              Interface Wordnet with Python

3.2              For each noun obtained from module 2, construct tuple element.

Synset Offsets and Sense Number of words returned by WordNet are needed.

These words are related to the given noun by the relations of:

          »    Synonyms (upto four levels of depth)

          »   Hypernyms (upto four levels of depth)

          »   Antonyms (upto four levels of depth)

 

3.3              DATA STRUCTURES:
    element = ['token' , sense number , frequency]
    synset-offset = ["list of synonyms synset offsets", "list of antonyms 
                                synset offsets", "list of hypernyms synset offsets"]
   chain = [element1 , element2 , .............]
   seg-chains = [[chain1 , wt1] , [chain2 , wt2] , [chain3 , wt3] , .............]

3.4              Aggregate the tuples with similar synset offsets

3.5              Prune the weak chains  

            swt = addition of all weights in seg-chains / length of seg-chains
if seg-chains[i][1] < swt then remove this seg-chains[i]

3.6              Intersegment linking

            For each word in any seg-chains match with other seg-chains i.e match 
for same token and same sense number:
if found then merge the two chains

3.7              Display lexical chains

 

Module 4:      Text Extraction 

4.1              Identify Strong Chains

4.1.1        Calculate Length and Homogenity Index  

            length = frequency from table 
HI = 1 - [number of distinct words in the chain / total words in the chain]
score = lenght * HI

4.1.2        Calculate Score

4.2              Identify representative words

            find avg word frequency = avg
if frequency of word >= avg then word is representative word

4.3              Display strong chains

4.4              Extract sentences.  

            search the list of complete document for the presence of representative 
words.
calculate the sentence number.
display this sentence from sentence list.