Analysis of Domain Independent Statistical Keyword Extraction Methods for Incremental Clustering

Título: Analysis of Domain Independent Statistical Keyword Extraction Methods for Incremental Clustering

Autores: Rossi, Rafael Geraldeli; Marcacini, Ricardo Marcondes; Rezende, Solange Oliveira

Resumo: Incremental clustering is a very useful approach to organize dynamic text collections. Due to the time/space restrictions for incremental clustering, the textual documents must be preprocessed to maintain only their most important information. Domain independent statistical keyword extraction methods are useful in this scenario, since they analyze only the content of each document individually instead of all document collection, are fast and language independent. However, different methods have different assumptions about the properties of keywords in a text, and different methods extract different set of keywords. Different ways to structure a textual document for keyword extraction can also modify the set of extracted keywords. Furthermore, extracting a small number of keywords might degrade the incremental clustering quality and a large number of keywords might increase the clustering process speed. In this article we analyze different ways to structure a textual document for keyword extraction, different domain independent keyword extraction methods, and the impact of the number of keywords on the incremental clustering quality. We also define a framework for domain independent statistical keyword extraction which allows the user set different configurations in each step of the framework. This allows the user tunes the automatic keyword extraction according to its needs or some evaluation measure. A thorough experimental evaluation with several textual collections showed that the domain independent statistical keyword extraction methods obtains competitive results to the use of all terms or even selecting terms analyzing all the text collection. This is a promising evidence that favors computationally efficient methods for preprocessing in text streams or large textual collections.

Palavras-chave: Automatic keyword extraction; incremental clustering; text preprocessing

Páginas: 21

Código DOI: 10.21528/lmln-vol12-no1-art2

Artigo em PDF: vol12-no1-art2.pdf

Arquivo BibTex: vol12-no1-art2.bib