Título: Análise Comparativa de Embeddings Jurídicos aplicados a Algoritmos de Clustering
Autores: José Alfredo Ferreira Costa, Nielsen Castelo Damasceno Dantas
Resumo: Text clustering analysis plays an important role in the organization and comprehension of extensive amounts of textual data. By grouping semantically similar documents into coherent categories, or clusters, it is possible to extract pertinent information and the unearthing of latent patterns embedded within the text. Text clustering enables a deeper understanding of the underlying structure and relationships within textual data, therefore, unveiling patterns and thematic trends. This paper aims to evaluate the impact of different text embeddings in the task of clustering Brazilian legal documents. The embeddings were obtained from BERT (Bidirectional Encoder Representations from transformers) models: Jurisbert, Bert Law and Irisbert. Term Frequency-Inverse Document Frequency (TF-IFD) was also used as a base representation model for comparisons. Nine different clustering algorithms were tested, including methods such as MB Kmeans, DBSCAN, BIRCH. Experiments were conducted in a database of 30,000 documents in Brazilian Portuguese of judicial moves of the Tribunal de Justiça do Rio Grande do Norte. To evaluate the performance of the clustering algorithms, the Normalized Mutual Info and Jaccard coefficients were used. Processing time are also described for the different algorithms. Results suggest better results with embedding “Irisbert” and TF-IDF when considering NMI and Bert Law and TF-IDF when considering Jaccard coefficient, although “Irisbert” also produced good scores
Palavras-chave: Text clustering, Embeddings, BERT, Legal documents, NLP
Páginas: 9
Código DOI: 10.21528/CBIC2023-181
Artigo em pdf: CBIC_2023_paper181.pdf
Arquivo BibTeX: CBIC_2023_181.bib