Part 4: Inside Atom Ai – Orchestrating and Deploying RAG at Scale for Robust AI Performance

Introduction

In previous articles in this series, the team detailed how a retrieval-augmented generation (RAG) application extracts business insights from enterprise data and enriches them to produce meaningful and relevant output.

In this report, we discuss the key considerations in deploying and supporting Atom Ai (formerly WWT GPT) at scale to ensure a smooth production user experience for thousands of WWT employees. These considerations include using callback functions for logging; performing load testing to evaluate concurrent user experience support; ensuring security with built-in role-based access control (RBAC); and collecting user feedback for further improvement. All these are essential to the success of Atom and to ensuring a smooth rollout to the entire organization.

Logging

LLM-based applications frequently make use of callback functions. Callback functions are tied to the key processes within the application and execute when one of these processes begins, ends or throws an error. They are essential to orchestration of LLM-based applications, as they allow us to perform actions in parallel to the main RAG process and enable increased visibility and tracking of each step.

Here are some of the ways we use callbacks within Atom Ai:

Token and cost logging: For each LLM call, we use a callback at the start and end of the call to track the number of tokens in both the prompt and the LLM response. This token count allows us to estimate the cost incurred with each run.
Run time tracking: We also use callbacks to track the run time for key process components, including the agentic decision-making by the LLM, each step within the retrieval pipeline, the augmentation of video transcripts, and the final response by the LLM. This tracking reveals bottlenecks in the process and potential areas for improved performance.
Source document tracking: Callbacks provide greater visibility and control of which documents the LLM is using for its response. They help facilitate in-text citations and some of the other unique things we've done to display these referenced documents to the application user.
Error identification: If an error is encountered at any step of the process encounters, a unique callback function will execute to indicate where the error occurred, enabling the team to quickly identify and resolve the root cause.
Intermediate states and response streaming: Through callbacks, we message the user to indicate when certain processes are beginning and ending. Additionally, we send out each token of the response as soon as we receive it from the LLM so the response is streamed token-by-token on the user interface (UI). Together, these functionalities improve user experience as users are not stuck looking at a blank screen while the LLM completes its response.

Introduction

Logging

Thanks for reading. Want to continue?