August 23, 2024
The NetApp and NVIDIA Infrastructure Stack
Welcome to part 1 of the videos series about the RAG Lab Infrastructure built in collaboration with NetApp, NVIDIA, and World Wide Technology. This video series will take you behind the scenes of this state-of-the-art lab environment inside the AI Proving Ground from WWT, powered by the Advanced Technology Center.
We will first start with a video going over the Hardware and Software stack of the Lab environment.
Hardware
NVIDIA DGX H100
The NVIDIA DGX H100 is a cutting-edge AI supercomputer designed to accelerate diverse AI workloads. Equipped with eight NVIDIA H100 GPUs, the system delivers exceptional performance for complex tasks like natural language processing, recommender systems, and data analytics.
We leverage the NVIDIA DGX H100 in this lab to serve up NVIDIA NIM microservices, Riva, and the RAG Application that is exposed to end users.
NetApp A800
The NetApp A800 is an all-flash storage array that is designed for use in hybrid cloud environments. It is one of the world's fastest enterprise NVMe flash arrays, and it offers industry-leading cloud integration. This makes it well-suited for AI storage, as AI applications often require fast storage and the ability to scale to meet the demands of large datasets.
Infrastructure Software
While the hardware provides us with reliable, fast, and efficient capabilities for our compute and storage, the software stack we used helps us to unlock the hardware's true potential.
NVIDIA Base Command Manager
NVIDIA Base Command Manager (BCM) is a software platform that enables users to easily access, orchestrate, and monitor their NVIDIA compute resources. In the lab environment, we deployed BCM to provision and configure the NVIDIA DGX systems that power the lab environment. With it, we can optimize our NVIDIA DGX infrastructure and accelerate our AI projects by effectively deploying and redeploying the NVIDIA DGX compute between projects.
Kubernetes
After Deploying BCM and integrating the NVIDIA DGX H100 Systems, we quickly moved to deploying Kubernetes with BCM orchestrating the deployment of the Kubernetes Cluster.
Kubernetes is the industry's leading open-source container orchestration solution. It is used in many AI solutions to help orchestrate not only the training and serving of different AI models, but also how applications that call the models are deployed and managed.
In this lab, Kubernetes was deployed to enable rapid deployment of the RAG (Retrieval Augmented Generation) Application and supporting underlying components such as NVIDIA NIM microservices and NVIDIA Riva.
NetApp Astra Trident
Next up, we moved to deploying a storage orchestration solution into the Kubernetes cluster to serve as the automation for our persistent workloads, like our Vector Databases.
NetApp Astra Trident is an open-source and fully supported storage orchestrator for containers and Kubernetes distributions. It accelerates the workflow by allowing end users to provision and manage storage from their NetApp storage systems without requiring intervention from a storage administrator. This frees up storage administrators to focus on other tasks. Additionally, Trident integrates with the entire NetApp storage portfolio, including NetApp ONTAP and Element storage systems. This allows users to leverage the existing storage infrastructure they already have.
In this lab, Trident manages the storage allocation for the A800 for the NIM microservices and Riva model caches, and the storage for the Vector Databases for the RAG Application scenarios.
Run:ai
After provisioning our persistent storage solution, we moved onto deploying a container scheduling engine inside the cluster that would help us optimize how the GPUs were leveraged in the Kubernetes cluster.
Run:ai is the go-to source to help organizations optimize their GPU consumption inside a Kubernetes cluster. In this lab, Run:ai enables us to dramatically extend the number of simultaneous deployments that can be leveraged on our NVIDIA DGX H100. It turns the 8 x NVIDIA H100 GPUs into the ability to request a whole GPU, or as small as a tenth of a GPU. This enabled us to go from 3 simultaneous labs to over 30 labs at the time of this recording.
Conclusion
Thanks for watching the first video of the RAG Lab Infrastructure Video Series, created in collaboration with NetApp, NVIDIA, and World Wide Technology. See you on the next video.