Back

#250

November 3, 2025

EP250 The End of "Collect Everything"? Moving from Centralization to Data Access?

Guest:

Balazs Scheidler, CEO at Axoflow, original founder of syslog-ng

Topics:

SIEM and SOC

29:29

Subscribe at Spotify

Subscribe at Apple Podcasts

Subscribe at YouTube

Topics covered:

Are we really coming to “access to security data” and away from “centralizing the data”?
How to detect without the same storage for all logs?
Is data pipeline a part of SIEM or is it standalone? Will this just collapse into SIEM soon?
Tell us about the issues with log pipelines in the past?
What about enrichment? Why do it in a pipeline, and not in a SIEM?
We are unable to share enough practices between security teams. How are we fixing it? Is pipelines part of the answer?
Do you have a piece of advice for people who want to do more than save on their SIEM costs?

Resources:

Do you have something cool to share? Some questions? Let us know:

language Web cloud.withgoogle.com/cloudsecurity/podcast
mail_outline Mail cloudsecuritypodcast@google.com
Twitter @CloudSecPodcast
Download podcast

Transcript

The discussion provides a strategic reframing of log pipelines, moving their perceived value far beyond the common "reduce the SIEM bill" narrative. Balazs Scheidler argues that modern, observable pipelines are critical infrastructure for data quality, classification, normalization, and management, which legacy tools and SIEMs are unequipped or disincentivized to handle.

The central thesis is that the industry's failure to solve basic data quality, parsing, and schema issues (illustrated by a story of corrupted Palo Alto CEF logs being ingested for years) has rendered many detections useless. Pipelines are the only component incentivized to fix this "data quality gap."

Furthermore, by applying modern, declarative management principles ("cattle, not pets") to pipelines, organizations can finally automate the difficult feedback loop required to make an "output-driven SIEM" a practical reality, rather than an academic exercise.

🗒️ Detailed Discussion Summary

1. The Centralization vs. Access Debate

The conversation began by challenging the traditional "centralize everything" model of logging, which Balazs helped create with syslog-ng in the 1990s for compliance and correlation.

Refuting "Leave at Source": Balazs immediately dismissed the modern idea of leaving data on original sources and querying it via "AI agents" or federated search alone. He argued this is operationally naive, as data sources are not designed for retention and critical data will be lost. Data must be moved off the source.

A Hybrid Approach: The proposed solution is not a single, central repository. Instead, organizations should differentiate data. High-value security data can go to the SIEM, while high-volume, low-value data (e.g., temporary debug logs) can be routed to cheaper, local storage.

The Requirement: This hybrid model only works if it is supported by robust federated search ("access") and the ability to easily "rehydrate" data from cold storage back into the hot path when needed for an investigation.

2. The Real Problem: Pipeline Management

Balazs argued that the biggest failure of legacy log management was not the technology of moving logs, but the static and manual management paradigm.

"Pets vs. Cattle": Legacy pipelines are "pets." They require manual, static configuration for every source and destination (e.g., "open port 514 for this source, 515 for that source," "point this Palo Alto device to this IP"). This is brittle, error-prone, and scales poorly.

Modern Management: A modern pipeline must be managed like a "fleet of compute" (i.e., "cattle"). It requires declarative automation, similar to Kubernetes, where the operator defines the desired state ("all firewall logs go to the SIEM") and the system automates the configuration.

Enabling "Output-Driven SIEM": This automation is the prerequisite for an "output-driven SIEM." Anton noted his own "output-driven SIEM" concept is often impractical because the feedback loop (from "I need this detection" to "I am collecting the right data") is too manual. Declarative pipelines are the missing technology to automate that loop.

3. The Data Governance Gap and the "Eureka Moment"

The discussion identified the lack of data governance for security telemetry as a primary obstacle. We have it for business data, but not for logs.

The Classification Crisis: The core missing piece is automated classification. Organizations cannot translate high-level policy ("retain security logs for one year") into low-level action because they cannot reliably identify "security logs" from the "bytes on the wire" without massive, brittle regex configurations.

The Pain Point: This manifests in the classic, impossible request: "Give me a list of all security-relevant Windows event IDs." No one can definitively do this or maintain it.

The SIEM Incentive Problem: SIEMs are not incentivized to solve this. Their business model ("more data = more money") often makes them apathetic to data quality.

Data Quality "War Story": Balazs provided a critical example of a major firewall vendor (Palo Alto) whose official documentation for producing CEF (Common Event Format) logs results in corrupted data. The logs were non-compliant (multi-line, duplicate keys) and truncated at 2048 bytes with improper syslog framing. This "bogus, unparseable" data was ingested into a customer's SIEM for years, rendering all detections based on it completely useless.

Anton's "Eureka Moment": This anecdote crystallized the main point. The industry narrative for pipelines has been "use Vendor C (Cribl) to cut the bill for Vendor S (Splunk)." This is purely tactical. The strategic value of a pipeline is that it is the only component in the architecture that is incentivized and positioned to fix data quality.

4. Pipelines as the Hub for Normalization and Enrichment

With quality and classification solved, the pipeline becomes the logical place to handle normalization and enrichment.

Solving the "Schema Wars": The industry is flooded with competing schemas (UDM, ECS, CIM, OCSF, CEF, LEAF). Mapping from one schema to another (e.g., OCSF-to-UDM) is fragile. The correct approach is to use the pipeline's classification of the raw source data to map directly to any required target schema on demand. This provides true schema-agnostic access.

The Case for Pipeline-Level Enrichment: Anton challenged that enrichment is the SIEM's job. Balazs presented three counter-arguments for moving it "left" to the pipeline:

Scale: Running enrichment queries against terabytes of data per day in the SIEM is often not feasible.

Static Context: It is far easier to add static context (e.g., "this server is in rack 4, unit 15" from a CMDB) at the pipeline level.

Ephemeral Context: Critical real-time context (like the true source IP address) is lost by the time the log traverses NATs and proxies to the central SIEM. This must be captured at the pipeline's ingestion point.

5. Key Takeaways and Advice

Primary Value: The strategic value of modern pipelines is not cost-saving (which is a byproduct) but ensuring data quality, classification, and normalization.

Management: Pipelines must be managed via declarative automation ("cattle"), not manual configuration ("pets").

Balazs's Advice: The single most important thing security teams can do is to "please look at what you ingest." Observing the actual data, rather than assuming it is correct, is the first step.

⏳ Timeline of Discussion

Introduction and banter about the appeal of "pipelines" versus "not pipelines."

Anton frames the "Uber question": Are we moving from centralizing data to simply providing access to data?

Balazs counters the "leave data at source" narrative, highlighting the risk of data loss, and proposes a hybrid model (some centralized, some local).

The conversation shifts to the real problem with legacy pipelines: static, manual management.

Tim introduces the "cattle, not pets" analogy for modern pipeline management.

Balazs critiques the current manual state (e.g., configuring syslog ports by hand).

Anton raises the practical challenge of managing varied retention and streaming requirements.

Balazs identifies the root cause: the "data governance gap" for security data.

The core technical solution is proposed: automated classification of data at the source.

Anton and Tim share the common pain point of identifying "security logs" (e.g., the Windows event log problem).

Balazs discusses the misaligned incentives of SIEM vendors regarding data quality.

A "war story" is shared: how Palo Alto devices can send corrupted, truncated, and unparseable CEF logs, breaking detections.

This triggers Anton's "Eureka moment": The strategic value of pipelines is data quality assurance, not just cost-cutting (the "C vs. S" narrative).

Balazs confirms that cost-cutting simply "opened our eyes" to the deeper quality problem.

The discussion links automated pipelines to enabling the "output-driven SIEM" concept by solving its manual feedback loop.

Balazs explains how pipelines solve the "schema wars" (UDM, OCSF, etc.) by using classification to map from the raw source, not from schema-to-schema.

Anton challenges the role of enrichment, asking if it should be in the SIEM.

Balazs provides a three-point case for pipeline-level enrichment (scale, static context, and ephemeral/real-time context like source IP).

The hosts conclude with final questions, and Balazs gives his closing advice: "Look at what you ingest.

View more episodes