Part 5: Inside Atom Ai – Evaluating the Impact of RAG at Scale on AI Efficacy
In the fifth article in this series, we describe the custom evaluation framework designed to assess the performance of the Atom Ai (formerly WWT GPT), an LLM-based chatbot that generates responses using a RAG pipeline.
One of the biggest challenges in building LLM-based applications is evaluating their performance. This is due primarily to the subjective assessment and inherent randomness of LLM-generated responses.
When Atom Ai (formerly WWT GPT) was initially released within WWT, evaluating its generated responses relied on manual intervention and feedback from a limited group of subject matter expert (SME) testers. To gauge the chat assistant's performance, each query and response pair was reviewed to identify potential areas of improvement in the Atom retrieval augmented generation (RAG) pipeline.
Release to a wider audience saw the user base surpass 3,500 individuals within the organization, with a significant number engaging with the GPT assistant daily. Manual assessment was naturally rendered impractical thereafter, prompting us to seek a more efficient and scalable solution.
While the chatbot has a mechanism for feedback collection, only a fraction of users — 7.5 percent to be precise — provided feedback on the responses generated by our GPT model. Moreover, user feedback is still currently limited to a binary format (i.e., thumbs up or thumbs down) devoid of the necessary nuances to understand the intricacies of model performance. Users can also provide comments to clarify their feedback, but the lack of quantity and quality available via this mode of feedback is not sufficient to gauge the performance of the application at scale.
To achieve a more robust approach to evaluating the generated LLM responses and user behavior trends, we designed a custom Evaluation Framework. The primary objective of this Evaluation Framework is to tackle the challenge of assessing LLM-based chat assistants, a task hindered by the absence of standardized benchmarks for evaluating their responses.
"WWT Research reports provide in-depth analysis of the latest technology and industry trends, solution comparisons and expert guidance for maturing your organization's capabilities. By logging in or creating a free account you’ll gain access to other reports as well as labs, events and other valuable content."
Thanks for reading. Want to continue?
Log in or create a free account to continue viewing Part 5: Inside Atom Ai – Evaluating the Impact of RAG at Scale on AI Efficacy and access other valuable content.