How Machine Learning and Generative AI will Affect Data Protection in 2024 and Beyond
In this blog
Introduction: Generative AI in 2024
With the advances in text and image generation demonstrated by OpenAI in the last year, artificial intelligence (AI) became the hot topic, not just in information technology, but in popular culture. It's going to touch everything we do in information technology in some way. This post will present the current state of AI integration in the data protection software space and then spotlight several major players to discuss their AI integration strategy.
It is worth noting that AI, as such, is not a new development that fell out of OpenAI's labs in 2023. The technology to train neural network models to perform specific tasks has been around for decades. Most commonly known as machine learning (ML), this kind of AI is already in use in industries as diverse as finance, healthcare, retail and entertainment. AI is used to perform complex math, generate recommendations, and automate image recognition and diagnostics. OpenAI has successfully turned this technology to the problem of generating coherent text in response to a prompt.
The incorporation of ML technologies into data protection products is not new – it has been used for malware and anomaly detection for some time. Now generative AI (GenAI) – using large language models to create and understand text data – is entering this space as well.
How will this affect the data protection software landscape?
There are three primary use cases for ML and GenAI in the data protection landscape today:
Machine Learning for threat hunting and ransomware remediation
Data protection, traditional backup and recovery, is in a unique position to deliver huge value back to the business. Backup software makes copies of an organization's most critical information every day. Today those copies usually just sit on expensive storage, waiting for a restore request that will likely never happen until they expire. Backup software also maintains a database of metadata about every file backed up: size, last access date, and even information about overall file entropy (is it encrypted?) are gathered and retained.
Leveraging this information to help an organization understand and manage its data is challenging, but two factors are coming together to help overcome the difficulties: ransomware and machine learning. The threat of ransomware is driving an intense need to understand an organization's data and scan it for signs of compromise. At the same time, ML is putting the capability of performing deep threat analysis against petabytes of data within reach. Data protection vendors are working to incorporate these capabilities into their products. It is reasonable to expect that threat hunting and meta-data analysis capabilities will be "table stakes" features within the next few years.
GenAI for deep text analysis of data on protection storage
The capability of current large language models to assimilate and "understand" large bodies of text data can be deployed to perform deep analysis of the content of data that has been backed up. This kind of analysis is useful for things like data classification and locating specific data sets for compliance and legal hold tasks.
Many organizations currently struggle with understanding their data as it relates to regulatory requirements such as GDPR, HIPAA and PCI. GenAI-enabled analysis of backup data can help organizations uncover misclassified or unclassified data and enact proper controls.
Once the data has been put in a format where the GenAI can parse it and understand the data in context, it becomes possible to perform detailed searches against that data in natural language. This enables additional data classification tasks such as "produce a list of all documents and emails that reference project jabberwocky" or "produce a list of all emails that mention one of the following anticompetitive practices…" on a large scale.
The effort required to "vectorize" this data for the GenAI model remains quite high so the adoption of this technology into the data protection space will likely be phased, with cloud-based data sets leading and on-premises vectorization and analysis of mission-critical data coming soon after. Performing this kind of analysis against the entire environment is probably going to have to wait for some enhancement of compute capabilities, but things are evolving quickly. The general adoption of these capabilities is probably much closer than we think.
GenAI for user assistance and interfacing
One of the earliest emerging use cases for large language model generative AI is around helping users automate tasks, essentially in real time using natural language. Microsoft refers to their implementation of this as "Copilot." The AI helps the user perform tasks such as code creation, document summary and revision or even the generation of complete email text.
In the data protection space, this kind of functionality will enable users to respond to alerts or perform activities based on a natural language communication model. The backup administrator will be able to ask an AI assistant to "create a data protection policy for all the clients in the environment with "SQL" in the hostname" and the assistant will walk the admin through any further information needed to create the policy: "how often do you want that to run?" and so forth.
An AI assistant will add value in helping administrators perform less common tasks or tasks that need on-the-fly automation: "Perform a recovery of all the clients in the 'critical applications' policy. Use the last known good backup and restore over the existing host." Walking admins through common troubleshooting and resolution scenarios is another use case for this type of assistant.
OEM examples of these use cases
All major backup vendors are working on some level of integration for all three of these use cases. Several vendors currently stand out in their early adoption and marketing of ML for threat detection, generative AI for assistance or Generative AI for Insight.
While we are looking deeply at Dell, Rubrik and Cohesity, it is important to keep in mind:
- Almost everyone in the industry is looking to integrate Natural Language Processing (NLP) into their product and support stack to reduce and streamline support and troubleshooting.
- Machine Learning (ML) is the technology that drives most threat hunting and anomaly detection platforms and almost every data protection vendor is incorporating threat hunting into their products.
- Analysis and interpretation of data residing on backup platforms has been the desired functionality in this space for over a decade and AI is almost certainly going to be the technology that enables it at scale.
Please treat these as examples. An in-depth look at all of the available options would result in a very long read.
Dell
Through their OEM partnership with Index Engines, Dell Technologies' CyberVault has provided ML-based anomaly detection and analysis against vaulted backups data for years. The product, called CyberSense, performs a deep file-level scan of vaulted data resident on the PowerProtect DD Series appliance (Data Domain) in the vault. A part of this process uses a trained neural network that performs content-based analytics to identify signs of corruption or compromise in the backup data. Index Engines claims 99.5 percent accuracy in detecting cyber-attack signatures.
This also enables rapid and accurate "blast radius" detection for an attack. "Which hosts are compromised, and which backups are clean?" is an important question during a cyber-attack.
Of course, this functionality comes at a price, driving a large compute workload in the vault and consuming a lot of IOPS on the backup appliance. Customers should expect to deploy a performant environment to run CyberSense and an upper-tier PPDD series appliance if they want to scan their entire vault daily.
Rubrik
Rubrik announced its "Ruby" AI assistance technology in November of 2023. Billed as the "Rubrik Security Cloud AI Companion," Ruby uses a specially trained AI model to walk users through responding to anomalies and alerts from the Data Threat Engine. Typical incident response workflows would include identifying the hosts affected by the alert, identifying the backups (Rubrik calls them snapshots) that are clean and can be recovered from, quarantining any dirty backups to avoid recovering compromised data and then instantiating the recovery of the affected hosts from the last known good backup. All in natural language without having to open a recovery or security interface.
At first glance, this type of "front-ending the interface with Natural Language Processing" solution might seem to be an easy and obvious use case. The success of this type of AI enhancement is going to depend on the quality of the specific training that enables the AI to give good advice and guide the user toward the correct response.
Cohesity
Cohesity has announced its "Turing" suite of AI technologies. One of the first examples of this is set to hit the market in Q1 of 2024. This technology will enable customers to search and answer questions about the information in the backup data set to provide customers with the capability to use Generative AI along with natural language processing (NLP) programs from a cloud provider of your choice to gain deeper insights from your secondary data.
Through the mechanisms of data protection, Cohesity customers have created a de facto time series data lake with their secondary data. Paring this with NLP, artificial intelligence/machine learning (AI/ML), and Generative AI will enable customers to have conversations with that data and use historical context to develop a deeper understanding of what's going on. This will allow customers to re-leverage their own data to drive operational efficiencies by reducing time to action, accomplishing tasks faster, and getting more information from their systems to drive innovation.
Customers will leverage Cohesity's Turing platform to render backup data into a format that enables Generative AI and natural language processing to contextualize the data ('vectorization' of the data). This vectorized data then becomes an external data source for a Large Language Model AI (such as ChatGPT) to access using a technique called 'Retrieval Augmented Generation' or RAG. The AI can answer natural language questions about the backup data – including links back to the source data.
This utilization of backup data to drive insights back to the business is going to be compute intensive and, while it remains to be seen if it will be practical outside of a hyper-scalar environment, could be a game changing implementation of Generative AI technology. More information can be found here.
Conclusion
The current wave of Generative AI models is going to cause massive disruption across the information technology landscape. Data protection is no exception to this trend and the ability of Data protection solutions to integrate AI as a defense against cyber attacks, facilitate recovery and provide insight and analytics back to the business are being developed and deployed today.
Organizations looking for a partner who understands the landscape and has been working in the AI space for over a decade should partner with WWT.
WWT has committed to investing over $500 million in the next three years to drive the adoption of AI solutions at a global enterprise scale.
AI Proving Ground Lab Environment:
- WWT's Advanced Technology Center (ATC), built over the past 15 years, will be expanded.
- The ATC now includes a lab environment named The AI Proving Ground where organizations can try, test, and validate AI applications specific to their business needs.
- This initiative aims to help IT leaders assess and implement AI solutions more effectively.
Composable AI labs:
- WWT will launch a series of composable AI labs throughout early 2024.
- These labs allow companies to experiment, test, and develop custom AI solutions before committing resources.
- Housed within the ATC, these labs leverage cutting-edge equipment from leading hardware and software providers.
Generative AI focus:
- Demand for generative AI is growing, with more than half of organizations experimenting with this technology.
- WWT's investments in the ATC and data expertise are crucial for helping clients achieve better business outcomes.
- The goal is to harness AI's transformative potential while addressing the complexities of the evolving landscape.
- WWT's commitment to AI infrastructure, technology and expertise empowers organizations to embrace AI effectively and drive meaningful business impact.