Today I want to present to you a simple library and command line utility for extracting summary HTML pages or plain texts. The free and open source Python project called „Sumy“ ( [Link: github.com] ) can summarize texts and extract text information html pages and shorten this content. IT currently has a framework for text summaries implementing these summarization methods:
- Luhn – heuristic method reference
- Edmundson heuristic method with previous statistic research reference
- Latent Semantic Analysis LSA – one of the algorithm
- LexRank – Unsupervised approach inspired by algorithms PageRank and HITS reference
- TextRank – some sort of combination of a few resources probably Wikipedia and some papers in 1st page of Google 🙂
- SumBasic – Method that is often used as a baseline in the literature. Source: Read about SumBasic
- KL-Sum – Method that greedily adds sentences to a summary so long as it decreases the KL Divergence.
Natural language processing with the Python NLTK module we’re ready to try out text classification text summary and HTML content extraction automatically with the programming language Python ( [Link: github.com] ).
Photo: Richard Jones // Flickr.com // Attribution 2.0 Generic (CC BY 2.0)
Tags: Automatic text summarizer, Textrank, Latent Semantic Analysis, LSA, heurestic method with previous statistic research, inspired by algorithms PageRank and HITS