A Dimensionality Reduction Model Applied to Documents Useful to Compliance

Título: A Dimensionality Reduction Model Applied to Documents Useful to Compliance

Autores: Joao Amaral and Prof. Dr. Fernando Buarque de Lima Neto Buarque.

Resumo:
This paper proposes a semantic Natural Language Processing (NLP) approach used to assist in the automated characterization of information relevant to compliance activities. In this context, the Latent Semantic Analysis (LSA) technique was used to assist in the dimensionality reduction process. The evaluated results were achieved through the submission of two databases to the model, namely: Database of Audit reports issued by the State General Secretariat of Management (SCGE – Secretaria da Controladoria-Geral do Estado, in Portuguese) of Pernambuco between the years of 2010 to 2019 and a Base of Appellate Decisions issued by the Brazilian Federal Accountability Office (TCU – Tribunal de Contas da União, in Portuguese) in 2019. The performance of two dimensionality reduction methods was evaluated: Tf-idf and LSA. To validate the results, K-means was used as a clustering technique. In addition, it was observed that the Silhouette technique helped us find the best cluster value for a given data sample. In the results, LSA associated with K-means presented the best performance in both databases, having achieved the best results in the TCU Base of Appellate Decisions.

Palavras-chave:
Dimensionality reduction., Clustering., Topic modeling., Latent semantic analysis (LSA)., Natural language processing., Text mining..

Páginas: 6

Código DOI: 10.21528/CBIC2021-1

Artigo em pdf: CBIC_2021_paper_1.pdf

Arquivo BibTeX: CBIC_2021_1.bib