Part 5: Inside Atom Ai – Evaluating the Impact of RAG at Scale on AI Efficacy

One of the biggest challenges in building LLM-based applications is evaluating their performance. This is due primarily to the subjective assessment and inherent randomness of LLM-generated responses.

When Atom Ai (formerly WWT GPT) was initially released within WWT, evaluating its generated responses relied on manual intervention and feedback from a limited group of subject matter expert (SME) testers. To gauge the chat assistant's performance, each query and response pair was reviewed to identify potential areas of improvement in the Atom retrieval augmented generation (RAG) pipeline.

Release to a wider audience saw the user base surpass 3,500 individuals within the organization, with a significant number engaging with the GPT assistant daily. Manual assessment was naturally rendered impractical thereafter, prompting us to seek a more efficient and scalable solution.

While the chatbot has a mechanism for feedback collection, only a fraction of users — 7.5 percent to be precise — provided feedback on the responses generated by our GPT model. Moreover, user feedback is still currently limited to a binary format (i.e., thumbs up or thumbs down) devoid of the necessary nuances to understand the intricacies of model performance. Users can also provide comments to clarify their feedback, but the lack of quantity and quality available via this mode of feedback is not sufficient to gauge the performance of the application at scale.

Figure 1: WWT GPT user trend: steady daily user count across weeks but significantly improved performance with higher thumbs up over time — **Figure 1**: Atom Ai user trend: Steady daily user count across weeks but significantly improved performance with higher thumbs up over time.

To achieve a more robust approach to evaluating the generated LLM responses and user behavior trends, we designed a custom Evaluation Framework. The primary objective of this Evaluation Framework is to tackle the challenge of assessing LLM-based chat assistants, a task hindered by the absence of standardized benchmarks for evaluating their responses.

Thanks for reading. Want to continue?