Extractive Text Summarization using TextRank Algorithm (Basic Logic)

Overview

Introduction

Analyzing and coming to a conclusion about a text can be considered a hassle or an annoying task by many people. A good reader can get these elements by simply focusing on important stuff that they consider essential to understanding the text. If you fall into the former category when you are tasked with a job like summarizing a text or getting the main idea behind it, you may simply don’t want to or can’t do it. Whether it be time or mental related, it is understandable that you don’t want to do this task. This is where NLP with its text summarization techniques can help you. With these techniques, it is plausible that you can find a suitable solution for your articles, news, and many other text-related documents when you want to summarize or get the main idea behind them.

Text summarization is a need and a solution for problems that require NLP techniques. Although there are many other instances of models for summarizing texts, this article will talk about TextRank and PageRank, a more elementary method compared to the new ones.

If you are new to text summarization algorithms, TextRank is a good start to understand the first way of thinking when thought about summarization in its early days. If you are looking for more advanced techniques like Transformers and such, you should check other specific articles regarding these methods as they are a bit more complex compared to TextRank.

TextRank applications are usually considered lightweight and require a limited amount of memory compared to Transformer models like BERT, which usually demand more resources from their benefactor.

Text Summarization Methods

Text summarization can be applied in two different categories:

In this article, we will talk about Extractive Text Summarization.

TextRank Algorithm

TextRank is a text processing graph-based ranking model that can be used to identify the most important sentences in the text. TextRank’s basic concept is to give a score to each sentence for their importance, then sort them accordingly. The first sentence that is shown is to be believed as the main idea of the text, also can be understood as its summary.

Work Principle of TextRank

The first step is to divide the text into sentences. After separating all the sentences in the text into a list, a chart is created with nodes, each sentence is a node, these nodes are linked by weight and similarity. These variables will be ranked with. This will be a stage where the most time is spent and where it will be thought to increase creativity. Different similarity scores created as a result of these improvements will significantly change the TextRank result. These similarity scores will be put into a graph that will help to decide which sentence is the most important. After getting the result of the graph, PageRank will be run.

PageRank algorithm is developed for websites that site each other. Whichever website is cited the most out of the given addresses will be seen as the root of those given websites. As an example for PageRank, suppose there is a node ‘A’. If website node ‘A’ and its neighbors and weights are ‘B’ (0.65) ‘C’ (0.04) ‘D’ (0.27) then the probability of citing from website A to each of those nodes as:

A to B = 0.65/(0.65+0.04+0.27)

A to C = 0.04/(0.65+0.04+0.27)

A to D = 0.27/(0.65+0.04+0.27)

With ‘A’ probability being the highest, website ‘A’ will be considered as the main website that is cited for all given websites. This example will work as an indication for TextRank, where instead of references, the words are counted and weighted. By choosing the sentence with the most important rank, the summary, or as it is thought the main idea, will be extracted.

An Example of TextRank Text Summarization

As it is with every other deep learning project, extracting the data from its original form is the first step. Cleaning should also be done to get the most accurate solution possible. The texts within the data will be split into sentences in order to make the ranking decision easier. After that, the words that have been used in those texts will be counted with the intention of importance ranking. The sentence with the most used words will be a front runner.

bag_of_words

The next part will be converting these sentences and all different words into vectors. After finding the similarity rates for every sentence according to the words that have been used, these rates will be put into a similarity matrix for each sentence that system processes.

These similarities will be put inside a network graph that will use the PageRank algorithm to sort the similarities of these values. The closest to 1 will be the sentence with the most similarity, thus assuming it as the main idea.

similarity of sentences

In the end, there will be a function to sort the text with the sentences of most importance.

result

You can access the article that has been used in the example here.

References

https://www.researchgate.net/publication/257947528_Text_SummarizationAn_Overview

https://www.researchgate.net/publication/319591034_The_enhancement_of_TextRank_algorithm_by_using_word2vec_and_its_application_on_topic_extraction

https://www.analyticsvidhya.com/blog/2018/11/introduction-text-summarization-textrank-python/?utm_campaign=News&utm_medium=Community&utm_source=DataCamp.com

Bilgi University / Computer Engineering