Detectando Incoerências Avaliativas em E-commerce com LLMs - Um Estudo de Caso na Amazon Brasil
Resumo
This study evaluates the ability of large language models (LLMs) to detect incoherence between the text of product reviews and their assigned rating (1 or 5 stars). Using ChatGPT-o3 in five independent runs, we observed high variability in labeling and low overall agreement (Fleiss’ 𝜅 = 0.177). A conservative approach selected 415 reviews unanimously labeled as incoherent, which were subsequently submitted for human evaluation. The agreement between human annotators was substantial (Cohen’s 𝜅 = 0.709), allowing the isolation of 231 cases with a clear sentiment judgment. The comparison showed that only 28.1% of the LLM classifications matched the human judgment. These results suggest that, while promising, LLMs still require rigorous validation and careful calibration for critical semantic interpretation tasks.
Palavras-chave:
Incoerência Semântica, Large Language Models, Avaliações Online
Referências
Turki Aljrees, Muhammad Umer, Oumaima Saidani, Latifah Almuqren, Abid Ishaq, Shtwai Alsubai, Imran Ashraf, et al. 2024. Contradiction in text review and apps rating: prediction using textual features and transfer learning. PeerJ Computer Science 10 (2024), e1722.
Amal Almansour, Reem Alotaibi, and Hajar Alharbi. 2022. Text-rating review discrepancy (TRRD): an integrative review and implications for research. Future Business Journal 8, 1 (2022), 3.
Michela Fazzolari, Vittoria Cozza, Marinella Petrocchi, and Angelo Spognardi. 2017. A study on text-score disagreement in online reviews. Cognitive Computation 9, 5 (2017), 689–701.
Joseph L Fleiss. 1971. Measuring nominal scale agreement among many raters. Psychological bulletin 76, 5 (1971), 378.
Michaela Geierhos, Frederik Simon Bäumer, Sabine Schulze, and Valentina Stuß. 2015. " I grade what I get but write what I think." Inconsistency Analysis in Patients’ Reviews.. In ECIS.
Nan Hu, Noi Sian Koh, and Srinivas K Reddy. 2014. Ratings lead you to the product, reviews help you clinch it? The mediating role of online review sentiments on product sales. Decision support systems 57 (2014), 42–53.
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. 2024. Gpt-4o system card. arXiv preprint arXiv:2410.21276 (2024).
Mir Riyanul Islam. 2014. Numeric rating of Apps on Google Play Store by sentiment analysis on user reviews. In 2014 international conference on electrical engineering and information & communication technology. IEEE, 1–4.
Wang-Cheng Kang, Jianmo Ni, Nikhil Mehta, Maheswaran Sathiamoorthy, Lichan Hong, Ed Chi, and Derek Zhiyuan Cheng. 2023. Do llms understand user preferences? evaluating llms on user rating prediction. arXiv preprint arXiv:2305.06474 (2023).
J Richard Landis and Gary G Koch. 1977. The measurement of observer agreement for categorical data. biometrics (1977), 159–174.
Shijie Liu, Ruixin Ding, Weihai Lu, Jun Wang, Mo Yu, Xiaoming Shi, and Wei Zhang. 2025. Coherency Improved Explainable Recommendation via Large Language Model. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 12201–12209.
Juan Pedro Mellinas, Juan L Nicolau, and Sangwon Park. 2019. Inconsistent behavior in online consumer reviews: The effects of hotel attribute ratings on location. Tourism Management 71 (2019), 421–427.
Susan M Mudambi, David Schuff, and Zhewei Zhang. 2014. Why aren’t the stars aligned? An analysis of online review content and star ratings. In 2014 47th Hawaii International conference on system sciences. IEEE, 3139–3147.
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. Advances in neural information processing systems 35 (2022), 27730–27744.
Denilson Alves Pereira. 2021. A survey of sentiment analysis in the Portuguese language. Artificial Intelligence Review 54, 2 (2021), 1087–1115.
Abhinav Sharma, Sangwon Park, and Juan L Nicolau. 2020. Testing loss aversion and diminishing sensitivity in review sentiment. Tourism Management 77 (2020), 104020.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).
Hetong Wang, Pasquale Minervini, and Edoardo M Ponti. 2024. Probing the emergence of cross-lingual alignment during LLM training. arXiv preprint arXiv:2406.13229 (2024).
Xiaoyu Zhang, Yishan Li, Jiayin Wang, Bowen Sun, Weizhi Ma, Peijie Sun, and Min Zhang. 2024. Large language models as evaluators for recommendation explanations. In Proceedings of the 18th ACM Conference on Recommender Systems. 33–42. DOI: 10.1145/3640457.3688075
Amal Almansour, Reem Alotaibi, and Hajar Alharbi. 2022. Text-rating review discrepancy (TRRD): an integrative review and implications for research. Future Business Journal 8, 1 (2022), 3.
Michela Fazzolari, Vittoria Cozza, Marinella Petrocchi, and Angelo Spognardi. 2017. A study on text-score disagreement in online reviews. Cognitive Computation 9, 5 (2017), 689–701.
Joseph L Fleiss. 1971. Measuring nominal scale agreement among many raters. Psychological bulletin 76, 5 (1971), 378.
Michaela Geierhos, Frederik Simon Bäumer, Sabine Schulze, and Valentina Stuß. 2015. " I grade what I get but write what I think." Inconsistency Analysis in Patients’ Reviews.. In ECIS.
Nan Hu, Noi Sian Koh, and Srinivas K Reddy. 2014. Ratings lead you to the product, reviews help you clinch it? The mediating role of online review sentiments on product sales. Decision support systems 57 (2014), 42–53.
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. 2024. Gpt-4o system card. arXiv preprint arXiv:2410.21276 (2024).
Mir Riyanul Islam. 2014. Numeric rating of Apps on Google Play Store by sentiment analysis on user reviews. In 2014 international conference on electrical engineering and information & communication technology. IEEE, 1–4.
Wang-Cheng Kang, Jianmo Ni, Nikhil Mehta, Maheswaran Sathiamoorthy, Lichan Hong, Ed Chi, and Derek Zhiyuan Cheng. 2023. Do llms understand user preferences? evaluating llms on user rating prediction. arXiv preprint arXiv:2305.06474 (2023).
J Richard Landis and Gary G Koch. 1977. The measurement of observer agreement for categorical data. biometrics (1977), 159–174.
Shijie Liu, Ruixin Ding, Weihai Lu, Jun Wang, Mo Yu, Xiaoming Shi, and Wei Zhang. 2025. Coherency Improved Explainable Recommendation via Large Language Model. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 12201–12209.
Juan Pedro Mellinas, Juan L Nicolau, and Sangwon Park. 2019. Inconsistent behavior in online consumer reviews: The effects of hotel attribute ratings on location. Tourism Management 71 (2019), 421–427.
Susan M Mudambi, David Schuff, and Zhewei Zhang. 2014. Why aren’t the stars aligned? An analysis of online review content and star ratings. In 2014 47th Hawaii International conference on system sciences. IEEE, 3139–3147.
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. Advances in neural information processing systems 35 (2022), 27730–27744.
Denilson Alves Pereira. 2021. A survey of sentiment analysis in the Portuguese language. Artificial Intelligence Review 54, 2 (2021), 1087–1115.
Abhinav Sharma, Sangwon Park, and Juan L Nicolau. 2020. Testing loss aversion and diminishing sensitivity in review sentiment. Tourism Management 77 (2020), 104020.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).
Hetong Wang, Pasquale Minervini, and Edoardo M Ponti. 2024. Probing the emergence of cross-lingual alignment during LLM training. arXiv preprint arXiv:2406.13229 (2024).
Xiaoyu Zhang, Yishan Li, Jiayin Wang, Bowen Sun, Weizhi Ma, Peijie Sun, and Min Zhang. 2024. Large language models as evaluators for recommendation explanations. In Proceedings of the 18th ACM Conference on Recommender Systems. 33–42. DOI: 10.1145/3640457.3688075
Publicado
10/11/2025
Como Citar
MARREIRA, Emanuelle; MELO, Tiago de; OLIVEIRA, Miguel; MAURÍCIO, Carlos.
Detectando Incoerências Avaliativas em E-commerce com LLMs - Um Estudo de Caso na Amazon Brasil. In: BRAZILIAN SYMPOSIUM ON MULTIMEDIA AND THE WEB (WEBMEDIA), 31. , 2025, Rio de Janeiro/RJ.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2025
.
p. 535-539.
DOI: https://doi.org/10.5753/webmedia.2025.15952.
