A Deep Dive into AWS and NVIDIA NIM Integration
In this blog
Overview
Artificial Intelligence (AI) is here to stay and will eventually become a core part of IT, just as networking, storage and compute are the pillars upon which everything rests. AI is the great accelerator of our time and will improve performance in all aspects of IT and many others.
In this accelerated world, there are giant entities that contribute to the foundation of technology: AWS and NVIDIA. NVIDIA and AWS both provide software tools and high-performance architecture to support the new world. While there is an overlap, NVIDIA and AWS have areas they focus on and excel at as market leaders. NVIDIA focuses on the core components relevant to AI, while AWS focuses on the surrounding components that support AI.
WWT's role and achievements
WWT has an extensive history of integrating NVIDIA hardware and software for customers and won NVIDIA Solution Provider, Networking Partner of the Year award in 2023 and NVIDIA AI Enterprise Partner of the Year award in 2024. In March 2024, WWT committed $500 million to spur AI development and customer adoption. WWT is also an Elite partner inside NVIDIA Partner Network, providing customers with AI, machine learning, virtual desktop infrastructure and networking solutions.
WWT also has a long history of providing services for AWS and helps our customers to design, build, migrate and manage cloud solutions using AWS best practices. WWT is an Advanced Consulting Partner with AWS and entered into a Strategic Collaboration Agreement with AWS in 2024. The multi-year agreement aims to accelerate advancements in generative artificial intelligence (AI), edge-based AI and smart cloud solutions for the commercial enterprise and public sector.
By combining the power of three industry titans — WWT, NVIDIA and AWS — we can create truly revolutionary solutions for customers. NVIDIA provides core AI, AWS provides support to the core, and WWT integrates it all to fit and solve customer challenges.
The genesis of the foundation
To kickstart this revolutionary project, we are building on a robust foundation by leveraging both AWS and NVIDIA technologies. Our approach begins with utilizing NVIDIA's reference architectures, which we will enhance by integrating AWS's reference architectures. Following this, we will develop a visual interface to demonstrate the seamless integration of these technologies.
Once the integration is complete, we will deploy it as an on-demand lab within WWT's Advanced Technology Center. This setup will allow customers to request a demo and experience firsthand how the system operates. Additionally, they will have access to a user-friendly GUI application that manages all the underlying processes.
Core pillar one: NVIDIA
NVIDIA NIM™ is a set of easy-to-use microservices designed for secure, reliable deployment of high-performance AI model inferencing across clouds, data centers, and workstations.
NIM is ideal for cloud platforms and simplifies the deployment of generative AI models and GPU-accelerated servers. NIM is part of NVIDIA AI Enterprise and is the "easy button" for deploying secure, reliable and high-performance AI model inference on AWS services. At the core, it is a set of containerized microservices that work in concert to support AI workloads.
In this research article, we are using NVIDIA NIM to deploy Meta's Llama 3 8B model.
Note that you must be a member of the NVIDIA Developer Program or sign up for a 90-day NVIDIA AI Enterprise license to try this on your own.
Core pillar two: AWS
AWS offers a variety of EC2 instances with NVIDIA GPUs and all the supporting components needed to host NVIDIA NIM. One core component we have leveraged is AWS Elastic Kubernetes Service (EKS).
Amazon EKS is a managed service for running Kubernetes workloads on AWS. It eliminates the need to install and operate your own Kubernetes control plane or nodes. AWS EKS is certified Kubernetes conformant, allowing you to use existing tools and plugins from the Kubernetes community and AWS partners. Applications running on any standard Kubernetes environment are fully compatible and can be easily migrated to EKS.
We used EKS to orchestrate NVIDIA NIM on a single node. A few notes about our approach:
- Single node usage: EKS automatically manages the availability and scalability of the Kubernetes control plane nodes responsible for scheduling containers, managing application availability, storing cluster data and other key tasks. While we used it on a single node for development or testing purposes, EKS is typically used in a multi-node setup to fully leverage Kubernetes' capabilities for scaling and high availability.
- Integration with AWS Services: We used EKS because EKS integrates seamlessly with AWS networking, security and infrastructure services, benefiting from AWS's performance, scale, reliability and availability.
Leveraging AWS research
We started the project with this article written in the AWS HPC Blog. We were able to follow it to achieve many of our goals but did run into some challenges which we will explain here. We also enhanced the result by building a visualization layer (aka Chat Bot GUI) that lets you test the LLM via a user interface.
When we first started reading the AWS blog post, it explained the individual steps required to successfully deploy NVIDIA NIM on EKS in good detail and linked to the GitHub repository with all the code, providing even more detail. However, as we got into the actual deployments, we found we were missing some context and key pieces and wanted to make some changes to the product to make it a bit more user-friendly for individuals who may be just getting started.
This article will walk you through what we had to do to get this working successfully, as well as what we added along the way.
First hurdle – AWS quotas and AWS EC2 instance types
To successfully deploy NVIDIA NIM, you will need an instance type that includes a supported GPU. For most accounts, this will entail some planning around AWS Regions and Service Quota change requests to AWS as most accounts do not have the ability to deploy the required GPU-backed instances by default.
For our deployments, determining the instance type we wanted to use and how many we would like to deploy took a little bit of extra time. There are many different instance types that include GPUs, and the AWS blog references using a GPU with at least 24GB of RAM and specifically references a G5.48xlarge instance type. This instance type far exceeds what you need to start testing the NVIDIA NIM. We will address this later in the article.
AWS EC2 service quotes are based on your account's history of usage. Since this was a new AWS account, we had no history for AWS to validate our needs. This caused us problems in provisioning large instances.
AWS EC2 CPU service quotas are determined by the vCPU count of the instance type, so a G5.48xlarge instance type includes 8 A10G GPUs with 24GB of GPU RAM, 192-vCPUs and 192GB of system-level RAM. Here is where we ran into our first challenge. We wanted to use On-Demand EC2 instance types and be able to deploy multiple copies of this lab simultaneously. To do that with the recommended instance type would require a minimum of a total of 192-vCPUs quota for G5 type instances on-demand for launching a single copy of the lab. We requested a 384-vCPU quota increase in US-EAST-1, which would allow us to have two copies running simultaneously with the recommended instance type.
After several requests with AWS to increase our quota, we were not able to obtain the 384-vCPU quota that we wanted and settled on 256-vCPUs of quota in the region we were planning to leverage. In hindsight, the region we chose did matter, as some regions have more capacity available than others for unproven accounts. Had we made our request in US-EAST-2, we would have had better chances of success getting the quota we needed. We recommend that you do this type of work in an AWS account that has a proven history of usage if possible. Since we were creating a solution that could be deployed for on-demand labs, this was not an option.
Reading through the AWS blog post, it mentioned that you can use smaller instance sizes as the default profile of the Llama 3 8B model is fp16, which means it uses 16-bit floating point precision and eight billion parameters that require less than 20 gigabytes of GPU memory. Based on this, we looked at using a smaller instance type to get started. Our thought was that if we needed more GPUs, it would be simpler to scale our cluster out with smaller instance types as opposed to large instance types. We settled on a single G5.2xlarge instance, with a single A10 GPU, 8 vCPU and 32GB of system memory.
We also investigated purchasing a reserved instance type for this lab, but for now, we are leveraging on-demand as it meets our use cases. If our needs change, we will look to using a reserved instance. Now that we had enough on-demand capacity for the instance and region we were using, we could work through the deployment.
Second hurdle – Outdated blog steps
With the required instance type available, we were able to walk through a new deployment. While successful in deploying EKS and the worker node that leveraged our G5.2xlarge instance type, we were not able to get the NIM pod to run after deployment. After some brief troubleshooting, we realized that the AMI referenced in the AWS blog does not actually contain the NVIDIA GPU Operator.
About the NVIDIA GPU Operator
The NVIDIA GPU Operator is required for the NIM to run. It is essentially the drivers loaded in EKS. The drivers provide access to the GPU via the containers deployed to support the NIM.
The NVIDIA GPU Operator utilizes the operator framework within Kubernetes to interface with the NVIDIA software needed to provision the GPU used in this lab. It contains several necessary components, such as the NVIDIA drivers, the NVIDIA Container Toolkit and the Kubernetes device plugin for the GPUs. Without the drivers, we would not be able to use CUDA, which is a key component of inference and a vital component of this lab.
Deploying the GPU Operator is straightforward. Following the NVIDIA documentation directly, we attempted to deploy it, and while it was deployed, the pods for the GPU Operator would not start. After doing further research, we discovered we needed to include a command line option specific to AWS EKS and the underlying OS (CentOS) that is used by the EKS notes. This was found searching through the issues on the NVIDIA GPU Operator repository. The code is as follows:
helm install --wait --generate-name -n gpu-operator --create-namespace nvidia/gpu-operator --set toolkit.version=v1.13.1-centos7
Success!
Once we redeployed with the required toolkit version for EKS, the GPU Operator pods all started as designed. We were able to connect to the pods, run NVIDIA-SMI and see that we did in fact have access to the GPUs from the nodes. Now we were able to redeploy the NIM, and by following the AWS blog post, we were able to successfully query the LLM and get a response. At that point, everything in the AWS blog post was working as expected; however, the interaction with the NIM was all being done via the CLI and JSON formatted inputs and outputs. For a better user experience in the lab, we wanted to see if we could make the data input simpler, so the user could choose to use the CLI or perhaps a GUI.
Making it better with a little help from AI
We ended up creating a small Streamlit application that created a front end that was like a ChatGPT-like experience with a GUI where a user can input requests, and the NIM provides the output, as well as a streaming chat, and memory within the chat. We used Anthropic's Claude 3.5, accessed via AWS Bedrock, to help us quickly design the Streamlit application, create a container and deploy it to AWS Elastic Container Registry, and make the required additions to our deployment scripts that also added an AWS Load Balancer, exposed the NVIDIA Inference API to the Streamlit App, and provided a public URL that was firewalled to our specific IP space which allowed for direct access to the application.
Figure 2.1 – The User Interface WWT Created
Improving the experience via automation
With all the necessary configuration files and CloudFormation templates prepared, we needed an automated script to create and destroy the lab environment to manage costs effectively. Although the lab is relatively inexpensive to run, costing less than $2 per hour, leaving it operational continuously would result in significant expenses over time.
By leveraging Anthropic Claude Sonnet 3.5, we created a shell script that handles the creation and teardown of all necessary components. We can deploy this script using Terraform as needed for the lab user. The entire deployment process, fully automated, takes about 25 minutes to achieve a fully operational environment. The most time-consuming part of this deployment is EKS, which takes approximately 20 minutes to provision. To further enhance the lab, we plan to implement an always-on EKS instance to reduce deployment time and improve the user experience, as EKS is significantly less expensive than GPU-backed instance types.
The result was a fully functional lab that can be deployed using AWS CloudFormation. This setup provides users with direct access to EKS for traditional Kubectl commands and includes a front-end GUI-based application for basic chat functionality.
What's next?
We are continuing to evolve the lab and plan to add observability and benchmarking to the core components. We also plan to build other labs based on this one that will add industry-specific use cases and technology. Stay tuned!
In the meantime, you can request the lab today and schedule a time for one of our architects to walk you through the creation, deployment and use of the interface that operates the underlying NVIDIA NIM on AWS with the Llama 8B fp16 model.