Enhancing Virtual Human Interactions by Designing a Real-Time Dialog Filter for Mitigating Nonsensical Responses

Resumo


Virtual Humans (VHs) are crucial in facilitating discussions on sensitive topics and training interpersonal interactions. However, conversational errors, like nonsensical responses, challenge VH simulation effectiveness. This paper explores real-time dialog filters to detect such undesired exchanges. We employ a five-step prompt design iteratively and leverage OpenAI’s GPT large language model to demonstrate feasibility. Our filter distinguishes meaningful from nonsensical responses generated by a rule-based system, achieving high F1 scores (0.84) and accuracy (0.78). Comparison with human-expert classifications validates its efficacy. Filtering nonsensical responses ensures coherent and relevant interactions, significantly enhancing efficacy. This study underscores how leveraging large language models can refine existing VH systems and improve virtual human dialogues.
Palavras-chave: Virtual Humans, Large Language Models, Real-Time Dialog Filter, Nonsensical Responses, Virtual Human Simulation

Referências

2023. Chat-openAI. [link]. (Accessed on 06/23/2023)

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Altenschmidt, J., Altman, S., Anadkat, S., Aleman, F. L., ... (2023). Gpt-4 technical report. arXiv preprint arXiv:2303.08774.

Babu, S., Schmugge, S., Inugala, R., Rao, S., Barnes, T., & Hodges, L. F. (2005). Marve: A prototype virtual human interface framework for studying human-virtual human interaction. In Intelligent Virtual Agents: 5th International Working Conference, IVA 2005, Kos, Greece, September 12-14, 2005. Proceedings 5 (pp. 120–133). Springer.

Bangalore, S., & Johnston, M. (2003). Balancing data-driven and rule-based approaches in the context of a multimodal conversational system. In 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No. 03EX721) (pp. 221–226). IEEE.

Barzilay, S., Assounga, K., Veras, J., Beaubian, C., Bloch-Elkouby, S., & Galynker, I. (2020). Assessment of near-term risk for suicide attempts using the suicide crisis inventory. Journal of affective disorders, 276, 183–190.

Biswas, S. S. (2023). Potential use of chat gpt in global warming. Annals of biomedical engineering, 51(6), 1126–1127.

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., ... (2020). Language models are few-shot learners. Advances in neural information processing systems, 33, 1877–1901.

Bylund, C. L., & Makoul, G. (2005). Examining empathy in medical encounters: an observational study using the empathic communication coding system. Health communication, 18(2), 123–140.

Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H. W., Sutton, C., Gehrmann, S., ... (2022). Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.

Diamond, C. (1981). What nonsense might be. Philosophy, 56(215), 5–22.

Galynker, I. (2017). The suicidal crisis: Clinical guide to the assessment of imminent suicide risk. Oxford University Press.

Giarratano, J. C., & Riley, G. (1989). Expert systems: principles and programming. Brooks/Cole Publishing Co.

Gomes de Siqueira, A., Yao, H., Bafna, A., Bloch-Elkouby, S., Richards, J., Lloveras, L. B., Feeney, K., Morris, S., Musser, E. D., & Lok, B. (2021). Investigating the effects of virtual patients' nonsensical responses on users' facial expressions in mental health training scenarios. In Proceedings of the 27th ACM Symposium on Virtual Reality Software and Technology (pp. 1–10).

Hallgren, K. A. (2012). Computing inter-rater reliability for observational data: an overview and tutorial. Tutorials in quantitative methods for psychology, 8(1), 23.

Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., de Las Casas, D., Hendricks, L. A., Welbl, J., Clark, A., ... (2022). Training compute-optimal large language models. arXiv preprint arXiv:2203.15556.

Knauf, R., Gonzalez, A. J., & Abel, T. (2002). A framework for validation of rule-based systems. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 32(3), 281–295.

Koubaa, A. (2023). GPT-4 vs. GPT-3.5: A concise showdown.

Krieger, J. L., Neil, J. M., Duke, K. A., Zalake, M. S., Tavassoli, F., Vilaro, M. J., Wilson-Howard, D. S., Chavez, S. Y., Laber, E. B., & Davidian, M. (2021). A pilot study examining the efficacy of delivering colorectal cancer screening messages via virtual health assistants. American journal of preventive medicine.

Lucas, G. M., Boberg, J., Traum, D., Artstein, R., Gratch, J., Gainer, A., Johnson, E., Leuski, A., & Nakano, M. (2018). Culture, errors, and rapport-building dialogue in social agents. In Proceedings of the 18th International Conference on intelligent virtual agents (pp. 51–58).

Lucas, G. M., Gratch, J., King, A., & Morency, L. P. (2014). It's only a computer: Virtual humans increase willingness to disclose. Computers in Human Behavior, 37, 94–100.

Manakul, P., Liusie, A., & Gales, M. J. F. (2023). Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. arXiv preprint arXiv:2303.08896.

Meskó, B. (2023). Prompt engineering as an important emerging skill for medical professionals: tutorial. Journal of Medical Internet Research, 25, e50638.

Parghi, N., Chennapragada, L., Barzilay, S., Newkirk, S., Ahmedani, B., Lok, B., & Galynker, I. (2021). Assessing the predictive ability of the Suicide Crisis Inventory for near-term suicidal behavior using machine learning approaches. International journal of methods in psychiatric research, 30(1), e1863.

Rizzo, A., Reger, G., Perlman, K., Rothbaum, B., Difede, J., McLay, R., Graap, K., Gahm, G., Johnston, S., Deal, R., ... (2011). Virtual reality posttraumatic stress disorder (PTSD) exposure therapy results with active duty OIF/OEF service members.

Rossen, B., & Lok, B. (2012). A crowdsourcing method to develop virtual human conversational agents. International Journal of Human-Computer Studies, 70(4), 301–319.

Schuck, A., Calati, R., Barzilay, S., Bloch-Elkouby, S., & Galynker, I. (2019). Suicide Crisis Syndrome: A review of supporting evidence for a new suicide-specific diagnosis. Behavioral sciences & the law, 37(3), 223–239.

Skarbez, R., Kotranza, A., Brooks, F. P., Lok, B., & Whitton, M. C. (2011). An initial exploration of conversational errors as a novel method for evaluating virtual human experiences. In 2011 IEEE Virtual Reality Conference (pp. 243–244). IEEE.

Sokolova, M., & Lapalme, G. (2009). A systematic analysis of performance measures for classification tasks. Information processing & management, 45(4), 427–437.

Technica, A. (2024). You can now run a GPT-3-level AI model on your laptop, phone, and Raspberry Pi. Retrieved from [link]

Thorat, S. A., & Jadhav, V. (2020). A review on implementation issues of rule-based chatbot systems. In Proceedings of the international conference on innovative computing & communications (ICICC).

Wang, Y., Khooshabeh, P., & Gratch, J. (2013). Looking real and making mistakes. In International Workshop on Intelligent Virtual Agents (pp. 339–348). Springer.

Yao, H., Gomes de Siqueira, A., Foster, A., Galynker, I., & Lok, B. (2020). Toward automated evaluation of empathetic responses in virtual human interaction systems for mental health scenarios. In Proceedings of the 20th ACM International Conference on Intelligent Virtual Agents (pp. 1–8).
Publicado
30/09/2024
DE SIQUEIRA, Alexandre Gomes et al. Enhancing Virtual Human Interactions by Designing a Real-Time Dialog Filter for Mitigating Nonsensical Responses. In: SIMPÓSIO DE REALIDADE VIRTUAL E AUMENTADA (SVR), 26. , 2024, Manaus/AM. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2024 . p. 51-60.