Embedding generation for Text Classification of User Reviews in Brazilian Portuguese: From Bag-of-Words to Transformers

Título: Embedding generation for Text Classification of User Reviews in Brazilian Portuguese: From Bag-of-Words to Transformers

Autores: Frederico Dias Souza, João Baptista de Oliveira e Souza Filho

Resumo: Text Classification is one of the most classical and studied Natural Language Processing (NLP) tasks. To classify documents accurately, a common approach is to provide a robust numerical representation, a process known as embedding. Embedding is a key NLP field that faced a significant advance in the last decade, especially after the popularization of Deep Learning models for solving NLP tasks, including Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Transformer-based Language Models (TLMs). Despite achievements, the literature regarding generating embeddings for Brazilian Portuguese texts still needs further investigation compared to the English language. Therefore, this work provides an experimental study of embedding techniques targeting a binary sentiment classification of user reviews in Brazilian Portuguese. This analysis includes classical (Bag-of-Words) to state-of-theart (Transformer-based) NLP models. We evaluate the models over five open-source datasets containing pre-defined partitions to encourage reproducibility. The Fine-tuned TLMs attain the best results for all cases, followed by the Feature-based TLM, LSTM, and CNN, with alternate ranks depending on the dataset

Palavras-chave: Machine Learning, Deep Learning, Natural Language Processing, Sentiment Analysis, Text Classification

Páginas: 8

Artigo em pdf: CBIC_2023_paper_CTDM01.pdf