Top Considerations for Building High-Performance Storage Solutions for AI
In this article
The rapid growth of generative AI has prompted businesses to quickly adopt AI solutions to benefit their customers and employees. While there's no doubt that AI's ability to generate insights from large datasets can be a catalyst for better and faster decision-making, many businesses feel compelled to move forward before fully understanding their AI use cases and workload demands. This rushed approach to AI adoption can lead to costly mistakes that delay solution deployment and reduce return on investment.
As discussed in our article, 8 Trends Shaping the Data Storage Landscape, customizing storage solutions for specific applications continues to be a trend. However, in many cases, traditional storage systems are ill-suited to the requirements of AI workloads. Organizations must first understand their AI-related business goals and desired outcomes before identifying the right infrastructure to get them there.
Training data is a foundational component of generative AI. You have to consider where that data lives, how it's accessed, the data's sensitivity, and what business processes are impacted by the AI model's performance requirements. Most of the time, there is a need for high-performance storage (HPS) when designing an AI solution for a high-performance architecture (HPA).
To make this process easier for our clients, WWT is building a cutting-edge AI Proving Ground inside our Advanced Technology Center (ATC). In addition to proofs of concept and enablement efforts, this new composable lab environment will help clients gain a comprehensive understanding of the leading storage offerings for AI use cases and HPA environments. While many storage solutions may appear similar, each vendor has key differentiators and capabilities that require careful consideration when building a new AI infrastructure.
In close collaboration with our OEM partners and early adopters within the AI landscape, WWT is committed to capturing relevant trends and insights concerning storage solutions. To that end, this article outlines our top considerations for selecting HPS solutions.
Performance
It's no surprise that performance is perhaps the most frequently discussed aspect of storage for AI environments. AI workloads are typically performance-intensive, both for computing and data access. Moreover, the processing and networking layers of AI solutions require a significant financial investment, meaning clients will want to make sure they're keeping those assets busy.
Storage needs to be able to "feed the beast," which means carefully considering latency and throughput needs when selecting a solution to support AI. The performance of the underlying storage system is especially important during AI model training, which can be a lengthy process. Faster modeling leads to more iterations, which can improve accuracy.
AI workloads are also often latency-sensitive. Your storage solution should be able to quickly respond to requests, even when under a heavy workload. For real-time AI applications, low-latency storage is critical to the efficiency of the computing environment.
Throughput is equally (if not more) important to consider when sizing a storage solution for AI. AI workloads, especially in the generative AI space, require high throughput due to the large amount of data being ingested and read. It is important to understand the requirements of the workload and size accordingly. Because GPUs are an expensive asset to leave in an idle state, throughput at the storage and network layers is key to keeping these resources busy when training models.
Scalability
AI environments generate and consume massive amounts of data. As AI capabilities evolve, models are becoming more complex and data-intensive. It is critical that HPS solutions are flexible enough to accommodate this level of unanticipated growth. A storage solution that can easily scale both performance and capacity will ensure your environment can meet the current and future demands of AI.
Traditional storage systems typically rely on a single controller or pair of controllers to manage capacity, which can lead to performance bottlenecks as data volumes grow. In contrast, HPS systems prioritize balanced scaling, adding both compute and storage capacity to maintain consistent performance and throughput. That's why it is crucial to understand each storage vendor's specific scaling approach and its potential impact on your environment.
Furthermore, significant differences exist in the architectural design of storage layers across various HPS vendors. Flash, SSDs, and even Storage-Class Memory (SCM) constitute the primary storage tier due to their inherent performance advantages. To optimize costs, some vendors employ tiered storage solutions, incorporating less expensive media for inactive data. That's why it is vital to carefully evaluate the workload characteristics, particularly during the modeling and inferencing phases, when weighing the trade-offs associated with each approach.
Reliability and data resiliency
High availability and data protection strategies are not just for primary storage environments; production systems also need to take availability and resiliency into account, and AI environments are no exception. HPS solutions incorporate a variety of data protection capabilities, from erasure coding of the data and node redundancy to snapshots, replication and tiering.
When working in storage environments that can easily scale to petabytes in size, it's important to understand the impact of failures on the overall system. Traditional RAID technology works well for smaller environments, but rebuild times and node redundancy are important to consider given the scale of AI solutions. Other factors, such as non-disruptive code updates and hardware upgrades, play a key role in system availability.
Security
Organizations looking to deploy solutions like generative AI are usually concerned with data security since models will often have access to sensitive company data. The implementation of a comprehensive data security strategy extends well beyond the storage layer, so it is important to ensure the capabilities needed for the environment are taken into account when evaluating a solution.
For HPS systems, data encryption (at rest or in flight) is a good start — but it's becoming increasingly important to be able to lock the data itself from malicious access. Other features such as data versioning and retention policies are rapidly becoming integral to the data security discussion. Moreover, global organizations will have to consider the data sovereignty laws of other nations when building out AI solutions that generate data containing personally identifiable information (PII) or other sensitive data.
Data lifecycle management
Organizations that anticipate multiple AI use cases across business units will need the ability to manage and move their data with ease, regardless of where AI model training is being performed. For example, while some business units may benefit from an off-the-shelf AI service in the public cloud, the organization will still want the ability to move, cache or replicate that data in and out of a hybrid cloud architecture. For those focusing on a single scale-out environment for LLMs, data management and movement may be less of a concern.
The easiest way to manage the placement of data on an HPS solution is to use high-speed flash from end to end. However, as these environments increase in scale, this approach becomes less practical and cost prohibitive. The frequency of access in an AI environment depends on the data set. For example, active training data sets require frequent access for immediate processing. Other data types, like archived models or historical data, are generally accessed less frequently. Modern HPS solutions have the ability to migrate infrequently accessed data to lower tiers of storage, or even to S3 object and tape storage. Careful consideration should be given to the design of a tiered solution to ensure it will meet the performance needs of the environment.
For clients with less predictable storage requirements, vendors are incorporating cloud-bursting capabilities into HPS arrays. Cloud bursting allows HPS arrays to scale capacity dynamically and seamlessly into the cloud layer. This technology can help reduce under-utilized infrastructure, particularly during the training and inferencing process. Like any strategy involving the cloud, clients must account for cost, performance and security implications before a capability like cloud bursting is deployed in a demanding AI environment.
Data center impact
Due to the incredibly demanding performance requirements of large-scale AI deployments, data centers are quickly feeling the need for additional power and cooling. The physical rack footprint of AI solutions is frequently problematic as well, as many data centers were simply not built to accommodate such environments. For example, if rack space is at a premium, HPS solutions focusing on density will have an advantage.
Each OEM has a play on density and data reduction. It is important to understand these differences and how they impact ESG and sustainability efforts for your organization. Most modern solutions have options for storage class memory (SCM), tri-level cell (TLC), quad-level cell (QLC) flash, and an ability to further tier off to even lower-cost media. Consequently, it is important to consider these factors when selecting an HPS solution, including how they may impact the total cost of ownership.
Future-proofing for what's next
The AI landscape is rapidly evolving, as are the requirements of the high-performance infrastructures and architectures that enable AI solutions. Selecting a solution that can meet the immediate and future storage needs of AI is no simple task.
Flexibility, then, is key to ensuring the storage component of your AI infrastructure remains viable. Achieving this flexibility will incorporate many of the topics discussed above, such as performance, scalability and data mobility. Additionally, maintaining flexibility in on-premises, hybrid or cloud-based deployment options can help organizations more quickly adapt to the changing needs of their environment.
How WWT can help
​Over the course of our history, we've leveraged the ATC's peerless capabilities to help countless clients accelerate their time-to-decision while driving innovation and critical infrastructure modernization. One of our best-kept secrets is that, during the last decade, we've also been helping clients research, build and implement advanced AI/ML solutions.
The AI Proving Ground inside the ATC is simply the next iteration in our company's commitment to helping organizations mitigate the risks of solving complicated business challenges with even more complex technology solutions. The AI Proving Ground's state-of-the-art lab environment has been designed to accelerate clients' ability to make smart decisions when designing their high-performance architectures for AI. They will be able to use our new AI labs to compare, test and train AI solutions with access to the latest reference architectures, hardware and software from the leading AI innovators — including many HPS systems — all in a secure, scalable and transparent manner.
For more on any of the topics discussed in this article, connect with one of our storage industry experts today.