When security information and event management (SIEM) first stepped onto the scene, security teams were enthralled with having a place to throw all their data and the following mantra was born: "More data means more insight." Without prejudice, SIEMs became a dumping ground for every log generated in an organization. However, the costs that followed suit made this concept no longer feasible. As a result, organizations had to make decisions on what should be evaluated.  

The process of picking and choosing what was to be ingested created new challenges in the form of visibility gaps. It seemed like a never-ending web of problems. We had too many logs, resulting in high price tags and "garbage in and garbage out" detections. The practice of restricting logs led to visibility gaps and lower fidelity alerts. Something had to give! Somewhere in this web of problems lies an answer. 

A light in the darkness 

For the longest time, tuning log ingestion was a practice of inclusion and exclusion. Logs were simply ingested or not and organizations were forced to fight in the cyber theater with one hand tied behind their back. But a glimmer of hope emerged.  Thanks to ingenuity and time, data pipeline management became an essential practice. While this has been a vital component built into some infrastructures, it is a shiny new diamond in the rough to others.  

So, what makes this concept so different and special? While the "how" can be a book in and of itself, the "why" is much simpler.  Data pipeline management allows companies to do the best of both worlds: Ingest what they need without breaking the bank. But the results do not stop with just price reduction. The world of data pipeline management opens organizations up to optimization and efficiency of people, processes and technology. Let's dig deeper! 

Data pipeline management: What it does best 

Cost reduction 

Whether your SIEM's bill is based on ingestion or workload, the amount of storage is always a primary driver in calculating costs. Bear with me for a moment while I spell this out. Ingestion-based pricing directly reflects the number of logs sent from the infrastructure. While workload-based pricing is less direct, the more storage that searches and correlations must go through increases compute power, which drives up costs. No matter how we approach this, storage and compute will always have a special place on your bill and cannot be decoupled. All of which is directly influenced by ingestion. 

But not to fear, data pipeline management is here! The year is no longer 2013, and organizations can now make decisions much more granular than inclusion or exclusion. Do you want firewall logs, just not all of them? Let us use simple Boolean logic to determine data routing. Do you want Active Directory logs, but are only interested in key fields? Simply extract the fields you need to go to your SIEM and either drop or archive the rest.  

While this can go deeper, it is relatively easy. Extract what you cannot live without and put the remainder somewhere cheaper. This approach will take serious thought and architecting, but when implemented throughout all your sources, you can reduce your bill and decrease your visibility gaps.  

Enrichment and ingest 

While running scripts and API calls is a bit too advanced for this stage of pre-processing, tagging and enriching data is an ideal use case. Instead of running these actions at query time, we can increase efficiency and reduce search times by having a substantial portion completed at ingest. What does this look like?  

Think of lookup tables and Boolean logic as the basis for pre-processing rules. For example, you can compare log values against lookup tables to append contextual information, tag IP addresses according to CIDR blocks, translate bitmasks in Windows logs to human readable codes, and, lastly, drop or shrink logs if certain values are present, which we will cover later. 

This is by no means an exhaustive list, but hopefully it begins to paint a clearer picture of what enrichment can look like at ingest. I cannot help but think back to my days as a SOC analyst and imagine how much query time this would have saved me, not to mention how much query logic would have been simplified. What was once an arduous process can now be completed before logs even hit your storage. 

Multiplexing  

To lay a foundation, multiplexing is the process of sending data to more than one location and two primary trends are emerging that put this into practice.  

Platform migrations 

For many reasons, customers are constantly bouncing from one SIEM platform to another. A lot of complexity comes into play with migrations, but nothing stands out more than data validation. Enormous amounts of content, processes and workflows are all born from the data being ingested, not to mention the library of custom alerts. 

The singular most important aspect of changing platforms is ensuring that the data you saw yesterday is the data you see tomorrow. But how can you be sure that the alerts and dashboards are showing the same results? Multiplexing. Many vendors lack the ability to forward logs to multiple locations simultaneously, leaving organizations without the ability to compare results. However, with data pipeline management, we can send identical logs to several endpoints and when we are satisfied with the results, we simply turn off the old locations.  

Cost savings

While data pipeline management is a cost-saving play by itself, the industry is seeing more organizations opt into log division. Subsets of logs are getting sent to security vendors and less used logs, such as those for compliance, are being shipped off to cold storage. This benefits both the ingest and compute-based pricing models! 

For those organizations that do not have as stringent storage requirements, logs can also be dropped and reduced to fit your needs. Do you have large Windows logs but only want specific fields? What about those firewall logs that just produce noise? Filter them, drop them or modify them. 

Vendor neutrality 

Data pipeline management (DPM) helps free you from vendor lock-ins. More times than not, platform migrations are daunting exercises due to the need to re-architect. All the sweat and tears poured into tuning and ingesting must be reconstructed. Think about all the locations that logs come from, the servers and endpoints that need to be cleaned up, and the work of re-deploying brand new agents. As an alternative, you can tune and deploy DPM once and for all. "Architect once." Quit digging more foxholes and perfect the one you have. 

Normalizing  

In essence, normalizing is the process of converting different log languages to one schema, allowing security vendors to correlate data across entire technology stacks. One common schema is the singular most important requirement to detect threats across an organization. That said, we now introduce the term "parser," which is the engine behind normalization.  

Often, parsers can be hard to construct, and nearly every platform will require some level of customization. Not only that, but you are stuck using the vendor's data format, which is usually unique to their product. This may not be a huge deal, but as we enter the world of AI/ML, organizations are starting to see reasons to use industry-standard formats such as OSCP or ECS. Storing data in one common schema that is readable by many LLMs will facilitate an organization's journey into AI-based detection. 

Notable OEMs 

While some of these capabilities may not be novel, data pipeline management is starting to play a much larger role in how organizations architect their data ingestion. SIEM vendors are beginning to include some of these capabilities, such as Splunk Edge Processor or Sumo Logic's Pre-processing rules. However, some vendors may stop innovating and offer "just enough" solutions due to the reduction in ARR these capabilities can cause.  

One of the more legacy tools in this space would be Logstash, a free and open-source solution. The product is well known for its ability to stream events and route logs to several different endpoints while providing log transformation. However, it lacks robust log reduction capabilities and is known to be resource intensive, taxing your system required for hosting. 

A few other names to mention are Kinesis Firehose, Vector, and Fluentd & Fluent Bit. Kinesis is known for its log transformation in the AWS infrastructure. While it can scale very quickly and is great for primary AWS users, its transformation abilities are left wanting. Both Vector and FluentD & Fluent Bit are open-source alternatives but are known to lack enterprise-level features. While all these solutions have their place and mesh perfectly within some organizations, two vendors are emerging as the industry leaders.

Cribl: With its strong ability to transform, reduce and multiplex logs, they have paved the way for data pipeline management. If you are looking for a platform change, to simply reduce costs or add pre-processing enrichment, Cribl has risen to be the trusted name in this vertical.  

Abstract: The most recent player to emerge among the competition is Abstract. With log trimming and transformation abilities akin to Cribl, they also provide the concept of filters and models. These two work together to take context and enrichment to a new level before the data lands in storage. 

Modern data storage and its role in data pipeline management 

Data storage systems today include cloud storage, data lakes and distributed databases, each offering advantages tailored to different types of data. For instance, cloud storage platforms provide scalable, cost-effective solutions for businesses that need to store vast amounts of structured and unstructured data. In addition, some data lakes allow organizations to store raw data at scale, enabling and supporting real-time analytics and machine learning processing. 

Modern storage solutions seamlessly integrate with data pipeline systems and, as you've now learned, are designed to collect, process and analyze data as it flows through the pipeline. This robust data pipeline involves several stages, including data ingestion, transformation and, of course, storage. Storage technologies play an important role. Efficient data storage ensures that data flows smoothly between pipeline stages without bottlenecks, supporting a fast and reliable solution. 

The alignment between modern data storage and data pipeline management is particularly important when it comes to their shared focus on scalability, flexibility and performance. With growing data volumes and the increasing complexity of data analysis, both storage systems and pipelines need to evolve rapidly. Storage must scale up or down based on usage, while pipelines must adapt to process data in real-time or batch modes. The integration of technologies like cloud-native storage and distributed computing further enhances this synergy between data pipeline and data storage, enabling pipelines to run more efficiently with minimal latency. 

In addition, modern data storage is designed to ensure that data security, compliance and accessibility are seamlessly maintained across the pipeline. Things such as encryption, access controls and audit logs allow organizations to protect sensitive information while providing authorized stakeholders with the data they need to drive insights and innovation. 

As data continues to grow and organizations strive for faster, more accurate insights, the alignment between modern data storage and pipeline management becomes more critical. By investing in scalable, secure and efficient storage solutions, your business can ensure your data pipelines and storage are optimized for success. 

Notable storage OEMs

Cloud storage and compliance platforms

General purpose cloud platforms for hosting, storing and processing security and operational data, like AWS, Azureand Google Cloud, offer cloud comprehensive security features to ensure data is protected against unauthorized access, breaches and ransomware attacks. By leveraging these advanced solutions, businesses can safeguard their most valuable data while maintaining compliance, performance and scalability.   

Next-generation SIEM security data platforms

Some OEM solutions, such as CrowdStrike's Next-Gen SIEM and Sentinel One's AI SIEM, offer specific data compression metrics, reducing the data volume and leading to customer cost savings while leveraging and enhancing applicable security data to enhance data analysis and threat detection. More traditional SIEM and data analytics solutions, such as Splunk, are widely used for both operational intelligence and detecting security threats while ingesting large volumes of data.  

Security data lakes and analytics platforms 

Other OEMs, such as Snowflake's data storage solution and Databricks, offer a unique architecture built specifically for the cloud and operate on top of other major cloud providers like Amazon Web Services, Microsoft Azure and Google Cloud. This unique approach is often used for data warehousing and data analytics, making it easier for organizations to perform complex queries, and reporting and is optimized for large-scale security telemetry storage.  

Data storage does come with several challenges, generally involving meeting the increasing growth demand. It's important to navigate these challenges by implementing the right strategies, technologies and practices to ensure your data storage solution is reliable and future-proof. 

Final thoughts 

At some point, all organizations need to change to keep up with today's evolving threat landscape. AI and ML-driven attacks are evolving, making them increasingly difficult to detect. SaaS adoption is generating more logs and noise than ever before and the acquisition of more products in an organization leads to larger attack surfaces. In security, does more data truly mean better visibility? If your organization is grappling with this question, our experts at WWT are here to help. Whether you need support with migrations, SOC transformations or cost optimization, we have the experience to guide you every step of the way.