Rare words in text summarization
Metadata
Afficher la notice complèteAuthor
Morozovskii, Danila
Date
2023-01-16Citation
Morozovskii, Danila. Rare words in text summarization; A Dissertation Submitted in Partial Fulfillment of the Requirements for the Degree of Master of Science in the Department of Applied Computer Science. Winnipeg, Manitoba, Canada: University of Winnipeg, 2022. DOI: 10.36939/ir.202201301602.
Abstract
Automatic text summarization is a difficult task, which involves a good understanding of an input text to produce fluent, brief and vast summary. The usage of text summarization models can vary from legal document summarization to news summarization. The model should be able to understand where important information is located to produce a good summary. However, infrequently used or rare words might limit model’s understanding of an input text, as the model might ignore such words or put less attention on them. Another issue is that the model accepts only a limited amount of tokens (words) of an input text, which might contain redundant information or not including important information as it is located further in the text. To address the problem of rare words, we have proposed a modification to the attention mechanism of the transformer model with pointer-generator layer, where attention mechanism receives frequency information for each word, which helps to boost rare words. Additionally, our proposed supervised learning model uses the hybrid approach incorporating both extractive and abstractive elements, to include more important information for the abstractive model in a news summarization task. We have designed experiments involving a combination of six different hybrid models with varying input text sizes (measured as tokens) to test our proposed model. Four wellknown datasets specific to news articles were used in this work: CNN/DM, XSum, Gigaword and DUC 2004 Task 1. Our results were compared using the well-known ROUGE metric. Our best model achieved R-1 score of 38.22, R-2 score of 15.07 and R-L score of 35.79, outperforming three existing models by several ROUGE points.