Cuttingedge - Text Summarization: Design Details

CUTTING EDGE
Text Summarization
This page was last updated on July 20, 2004

Cuttingedge Online
      Home
      Project Page
      Sitemap
Downloads
      Version 1.1
      Help

Project Brief
      About Project
    Design details
      Presentations
Future Work

Interact!
      Contact Us
    Discussion Forum
    Bug Reports

Developers
Programmers

    Design Details...

A look at the modules and their algorithmic details, with deadlines

Module 1:      Accepting Input

The input is technical_article.txt

Input in the text file has to be processed before proceeding to the segmentation.

Processing has to be done in order to:

                          - Identify paragraph breaks
                          - Identify sentence breaks
                          - Output should be one word per line in order to be tagged by fnTBL-1.0

This file is saved as processed.txt

Module 2:      fnTBL interface to Python

   Accept contents of processed.txt

Use fnTBL-1.0 to tag it in processed.tagged

Use fnTBL-1.0 to chunk it in processed.chunked

Extract nouns + noun compounds

Create a list of noun-tokens found for each pragraph

Module 3:      Wordnet interface to Python

3.1              Interface Wordnet with Python

3.2              For each noun obtained from module 2, construct tuple element.

Synset Offsets and Sense Number of words returned by WordNet are needed.

These words are related to the given noun by the relations of:

          »    Synonyms (upto four levels of depth)

          »   Hypernyms (upto four levels of depth)

          »   Antonyms (upto four levels of depth)

3.3              DATA STRUCTURES:
    element = ['token' , sense number , frequency]
    synset-offset = ["list of synonyms synset offsets", "list of antonyms
                                synset offsets", "list of hypernyms synset offsets"]
   chain = [element1 , element2 , .............]
   seg-chains = [[chain1 , wt1] , [chain2 , wt2] , [chain3 , wt3] , .............]

3.4              Aggregate the tuples with similar synset offsets

3.5              Prune the weak chains

            swt = addition of all weights in seg-chains / length of seg-chains
if seg-chains[i][1] < swt then remove this seg-chains[i]

3.6              Intersegment linking

            For each word in any seg-chains match with other seg-chains i.e match
for same token and same sense number:
if found then merge the two chains

3.7              Display lexical chains

Module 4:      Text Extraction

4.1              Identify Strong Chains

4.1.1        Calculate Length and Homogenity Index

            length = frequency from table
HI = 1 - [number of distinct words in the chain / total words in the chain]
score = lenght * HI

4.1.2        Calculate Score

4.2              Identify representative words

            find avg word frequency = avg
if frequency of word >= avg then word is representative word

4.3              Display strong chains

4.4              Extract sentences.

            search the list of complete document for the presence of representative
words.
calculate the sentence number.
display this sentence from sentence list.