Public DNS logs reveal hidden CDN costs

Blog 14 min read

With global companies losing nearly $400 billion annually to downtime, analyzing DNS logs is a financial imperative, not just an IT task. You will learn how to extract strategic value from resolver data, architect real-time visualization pipelines using ClickHouse and Grafana, and enforce network automation through rigorous Source of Truth principles.

The stakes are clear: when ISPs rely on public DNS services for 60% to 70% of resolutions, as observed in Bangladesh by APNIC Community Trainer Abu Sufian, they cede control over content location selection to external entities. APNIC research data This blindness prevents operators from knowing if traffic hits on-net cache nodes or expensive global peers, directly impacting service delivery costs and latency. Without proprietary instrumentation, the asset owner remains ignorant of the specific IP bindings serving their users.

To counter this, the session at APRICOT 2026 detailed a practical implementation using PowerDNS Recursor to capture query data without disrupting production environments. By feeding these logs into a structured ClickHouse database, engineers can map CDN sources per query and visualize traffic flows in near-real-time. This approach transforms raw log files into actionable intelligence, allowing networks to validate routing policies and optimize peering strategies based on empirical evidence rather than assumption.

The Strategic Role of DNS Logs in CDN Efficiency Analysis

How Public DNS Resolvers Obscure CDN Cache Node Visibility

Abu Sufian, Head of Technology at ADN Telecom data shows 60% – 70% of visible DNS resolves via public services like Google or Cloudflare. Cloudflare vs google dns This delegation shifts optimal content location selection to third parties rather than the network operator. Consequently, ISPs lose visibility into which specific cache nodes serve their subscribers. The mechanism relies on Global Server Load Balancing (GSLB) where the resolver's IP dictates the returned edge server address. When external resolvers handle queries, the ISP cannot distinguish between on-net and global traffic sources. According to APNIC Community Trainer presentation, public DNS adoption expanding at an annual rate of 27% since a 2011 baseline. This trend expands the operational blind spot regarding latency and transit costs.

Operators face a disconnect between perceived delivery paths and actual routing when DNS log analysis remains superficial. Validating whether users receive content from the nearest geographical point becomes impossible without deep inspection. HubSpot improved onboarding performance through the correlation of error codes and cache efficiency across multiple CDNs, proving that granular log access drives business metrics. Tuning CDN efficiency stays speculative lacking this data. Outsourcing resolution gains speed but sacrifices control over traffic engineering outcomes. Restoring the feedback loop required for precise routing adjustments demands local DNS termination.

Extracting On-Net versus Global Traffic Patterns from DNS Logs

Granular log parsing reveals specific cache nodes that would otherwise remain hidden from view according to Abu Sufian, Head of Technology at ADN Telecom data. Deploying PowerDNS Recursor captures query responses and maps returned IP addresses against known CDN ranges to achieve this visibility. Researchers analyzed DNS request behaviors using logs recorded by CERNET, providing a CDN service provider's perspective on network usage patterns. Local delivery distinguishes itself from distant global hops based on the resolver's geographic proximity logic. A Global B2B SaaS platform utilized CDN log analysis to correlate error codes and cache efficiency, resulting in a 20% reduction in first-time user latency during the onboarding process. Relying solely on DNS data ignores the underlying transport path validation that BGP provides. Integrating Border Gateway Protocol data allows operators to cross-reference the origin-AS of the serving IP with their own peering database. Traffic claimed as on-net might traverse expensive transit links due to suboptimal routing policies, a fact this integration exposes. Increased storage requirements represent the constraint for correlating high-volume DNS streams with BGP updates. Application layer resolution matching network layer topology gives operators definitive proof of caching efficiency.

The Business Impact of Third-Party Content Location Selection

Asset owners no longer control optimal content location selection as third parties assume command according to Abu Sufian, Head of Technology at ADN Telecom data. Externalization forces reliance on public DNS heuristics that often ignore local network topology in favor of global latency metrics. The returning CDN edge address reflects the resolver's position, not the subscriber's physical reality, when a local resolver is bypassed. Delegating this logic removes the operator's ability to steer traffic toward cheaper, on-net caches. Delays increase while potential savings from local peering vanish, creating a measurable cost. Standard dashboards hide the upstream resolver's influence, so operators must parse raw logs to reconstruct these decisions. Billing disputes regarding inter-zone traffic remain unresolved without this reconstruction. Revenue optimization falls entirely to external algorithms if operators fail to audit these third-party selections.

Architecture of Real-Time Traffic Visualization with Grafana and ClickHouse

Defining the Grafana-Akvorado-as reported by LibreNMS Data Pipeline Architecture

Skymedia case study, LibreNMS provides device health counters but lacks traffic breakdowns by source. This gap forces operators to deploy Akvorado for IPFIX and NetFlow flow-level aggregations. The architectural tension arises because Akvorado lacks native Role-Based Access Control, creating privacy risks for multi-tenant environments. Skymedia resolved this by engineering a custom Grafana layer that unifies both data streams under strict RBAC policies.

According to Key Data Points, the LibreNMS feed operates on a five-minute cycle to isolate query load. This interval introduces a slight lag but prevents dashboard interactions from overwhelming the management server. Conversely, the Akvorado integration relies on direct SQL queries against ClickHouse, exposing the system to potential data loss if the feed disconnects.

ComponentFunctionLatency Profile
LibreNMSDevice health countersFive-minute cycle
AkvoradoFlow aggregationNear real-time
GrafanaUnified RBAC viewVariable

A persistent limitation remains the asynchronous nature of these SQL queries, which can stall rendering if the underlying ClickHouse cluster experiences high write latency. Operators must tune retention policies carefully, as storing raw flow data indefinitely incurs prohibitive storage costs compared to aggregated metrics. The final architecture sacrifices some real-time fidelity for the security guarantees required by enterprise customers.

per Implementing RBAC in Grafana to Secure Customer Traffic Views

Skymedia case study, direct Akvorado use failed customer requirements due to missing Role-Based Access Control and competitor exposure risks. Munkhtulga Bayarkhuu designed a unified Grafana interface integrating LibreNMS health counters with ClickHouse flow aggregations to enforce strict view isolation. The mechanism routes InfluxQL queries for device metrics while executing direct SQL statements against flow data, creating distinct failure domains for each source. This architecture prevents cross-customer data leakage that raw flow exporters cannot stop natively. However, the direct SQL dependency introduces fragility where dashboard outages correlate strictly with database availability.

Device HealthLibreNMSFive-minuteLow latency impact
Flow AggregationClickHouseReal-timeHigh query load
Access ControlGrafana RBACInstantSingle point of failure

Operators must tune SQL execution paths to avoid resource contention when multiple tenants refresh views simultaneously. Based on Skymedia case study, cloning the database with reduced retention mitigated storage costs while preserving query performance for active incidents. The cost is increased architectural complexity requiring manual synchronization logic between primary and clone systems.

Direct SQL queries against flow databases expose sensitive source-destination pairs to unauthorized tenants when isolation fails. According to Skymedia case study, direct customer access to Akvorado was rejected due to missing Role-Based Access Control and competitor risks. The mechanism routes user requests through a custom Grafana layer that enforces strict view separation before querying ClickHouse. This architecture prevents cross-customer data leakage inherent in flat flow exporters. Yet the direct SQL dependency creates a single point of failure where dashboard availability correlates strictly with database stability. High concurrent query loads can overwhelm the storage engine, leading to real-time data loss during peak visualization periods.

Operators must balance granular visibility against the computational cost of asynchronous query execution. The constraint is increased storage complexity versus the certainty of tenant data privacy.

Operationalizing Network Automation via Source of Truth Principles

as reported by Defining Source of Truth with NetBox and Zabbix Integration

Bar chart showing 24% lower infrastructure spend and 42% higher staff productivity with managed services, alongside key metrics including an $18.72 billion market projection and 19% faster DNS response times.
Bar chart showing 24% lower infrastructure spend and 42% higher staff productivity with managed services, alongside key metrics including an $18.72 billion market projection and 19% faster DNS response times.

APRICOT 2026 Session Materials, Ulsbold Enkhtaivan defining NetBox as the authoritative database while Zabbix executes monitoring logic. This architecture separates intended state from observed reality, forcing operators to reconcile discrepancies before automation proceeds. The mechanism relies on Python scripts acting as glue, fetching device states from NetBox and pushing configurations via Paramiko. According to APRICOT 2026 Session Materials, this integration tracks deployment status from planned to active without manual CLI intervention. However, the system requires manual mediation to prevent critical routing plane failures during the fetch-change-test-rewrite cycle. Operators gain a documented lifecycle for device identity but sacrifice fully autonomous correction loops. The implication is a shift in operational risk: configuration drift disappears, yet reliance on human validation persists for high-impact changes.

ComponentRoleData Source
NetBoxAuthoritative StateStatic Definition
ZabbixMonitoring LogicReal-time Polling
PythonIntegration GlueCustom Script

Blind trust in automated reconciliation remains dangerous when upstream dependencies fluctuate.

Executing Zero-Touch Deployment via Email-per Confirmed Workflows

APRICOT 2026 Session Materials, staff utilizing email confirmations to trigger Paramiko SSH sessions for device updates. This zero-touch deployment mechanism replaces manual CLI access with a structured fetch-change-test-rewrite cycle driven by Python logic. Operators define intended states in NetBox, while Zabbix monitors real-time health, creating a closed loop where configuration changes only proceed after human validation via email link. The system tracks device identity and availability throughout the lifecycle, ensuring documentation matches the active network state. However, Ulsbold Enkhtaivan noted the workflow requires manual mediation to prevent critical routing plane failures during the automated rewrite phase. This limitation exists because fully autonomous path computation remains risky without deeper topology awareness in the orchestration layer. Consequently, operators gain a verified audit trail and reduced on-site wage costs, yet they must accept that the "zero-touch" label applies to execution, not decision-making. : safety relies on keeping humans in the confirmation loop rather than achieving full autonomy. Organizations replacing capital-intensive ownership models with such managed operational workflows record a 24% lower infrastructure spend. Industry analysis indicates a 42% increase in staff productivity when shifting from reactive manual intervention to these mediated automation cycles.

ComponentRoleConstraint
NetBoxSource of TruthRequires accurate initial data entry
ZabbixState MonitorAlerts only on observed deviations
ParamikoChange AgentNeeds pre-shared keys for SSH

based on Validating State Transitions from Planned to Active Status

APRICOT 2026 Session Materials, NetBox and Zabbix must synchronize to move devices from planned to active status accurately. This mechanism relies on Python glue code executing a fetch-change-test-rewrite cycle, ensuring the Source of Truth matches physical reality before traffic flows. Operators gain a documented lifecycle for every interface, preventing configuration drift that plagues manual updates. However, Ulsbold Enkhtaivan noted the system requires manual mediation to prevent critical routing plane failures during complex transitions. Relying solely on automated state changes without human verification risks propagating errors across the mobile backhaul network instantly.

CheckpointValidation TargetTool Source
Identity MatchSerial Number vs.
ReachabilityICMP/SSH ResponseZabbix
Config IntegrityTemplate Render CheckJinja
Integration LogicGitHub ConnectorGitHub

Should I use open-source tools for network monitoring? The operational cost shifts from licensing fees to engineering time required to sustain the custom logic binding Zabbix alerts to NetBox records. Failure to maintain this bond results in silent divergence between the database and the live network edge.

Comparative Analysis of open-source Flow Analysis Tools

Defining LibreNMS Device Health Versus Akvorado Flow Aggregation

Conceptual illustration for Comparative Analysis of open-source Flow Analysis Tools
Conceptual illustration for Comparative Analysis of open-source Flow Analysis Tools

LibreNMS polls device counters on a fixed 5-minute cycle, while Akvorado processes IPFIX streams for immediate traffic breakdowns. This temporal divergence creates distinct operational realities for network engineers managing enterprise visibility. LibreNMS excels at tracking device health metrics like CPU load and interface errors through SNMP, providing a stable historical baseline. Conversely, Akvorado ingests high-volume flow data to reveal source-destination conversations that simple counters obscure. The mechanism relies on pushing raw flow records into ClickHouse for aggregation, whereas LibreNMS stores rolled-up statistics in a time-series database.

FeatureLibreNMSAkvorado
Data SourceSNMP CountersNetFlow/IPFIX
Resolution5-minute intervalsNear real-time
Primary UseHardware statusTraffic forensics

Skymedia deployed this dual-system approach because Akvorado lacks native Role-Focused Access Control, creating privacy risks for multi-tenant environments. Munkhtulga Bayarkhuu noted that direct customer access to flat flow data exposes competitor information without strict view separation. The limitation is that synchronizing these disparate data sources requires complex SQL tuning to prevent query storms on the storage backend. Operators must accept that unifying these feeds introduces a single point of failure if the aggregation layer stalls.

Meanwhile, skymedia resolved Akvorado RBAC gaps by architecting a Grafana front-end that enforces access policies absent in the native flow tool. Munkhtulga Bayarkhuu designed this unified interface to merge LibreNMS device health counters with ClickHouse flow aggregations, shielding enterprise customers from competitor data exposure. The mechanism queries InfluxQL for SNMP metrics while executing direct SQL against flow storage, creating a single pane of glass for source-destination traffic views. However, direct database coupling introduces fragility; asynchronous dashboard queries risk starving the underlying ClickHouse instance during peak visualization demand. Operators gain granular visibility without exposing raw infrastructure details, yet they inherit a complex dependency chain where database tuning directly impacts dashboard responsiveness.

DimensionLibreNMS SourceAkvorado SourceUnified Grafana
Data Latency5-minute cycleNear real-timeMixed fidelity
Access ControlNative RBACNone availableEnforced via proxy
Query LoadIsolated time-seriesDirect SQL riskAsynchronous spike

Direct SQL exposure means a single runaway report can degrade visibility for all tenants sharing the cluster. This architectural tension forces a choice between real-time accuracy and system stability under load. Database cloning strategies mitigate retention costs but double the storage footprint required for historical analysis. Network teams must weigh the operational convenience of unified views against the engineering overhead of maintaining custom glue logic. True durability requires isolating customer-facing query paths from the primary ingestion engine entirely.

Privacy Risks and RBAC Gaps in Direct Akvorado Deployment

Direct Akvorado deployment exposes raw flow tables to unauthorized viewers because the platform lacks native Role-Oriented Access Control. This architectural absence prevents operators from restricting customer views to specific IP prefixes, creating immediate privacy violations. Competitors monitoring shared dashboard instances could infer traffic volumes and peer relationships without explicit authorization. Munkhtulga Bayarkhuu identified this gap as the primary barrier for Skymedia when offering self-service analytics to enterprise clients in Mongolia. The solution required wrapping LibreNMS and Akvorado outputs within a Grafana layer that enforces strict access policies.

CapabilityNative AkvoradoGrafana Wrapper
Access ControlNoneFull RBAC
Data IsolationGlobal ViewPer-Tenant Filter
Query SafetyDirect SQLControlled API

Operators attempting direct exposure risk leaking sensitive business intelligence through unfiltered ClickHouse queries. InterLIR recommends avoiding direct user access to flow collectors until multi-tenancy features mature. The trade-off involves increased complexity in the visualization stack to compensate for backend deficiencies. Network teams must engineer custom middleware to sanitize data before it reaches the browser interface. Failure to isolate data streams allows a single compromised credential to reveal total network topology. This vulnerability forces a choice between useful visibility and fundamental security hygiene. Most organizations cannot justify exposing unaggregated flow records to external parties. The cost of building a secure proxy often exceeds the initial savings of using open-source tools alone.

About

Vladislava Shadrina Customer Account Manager at InterLIR brings a unique client-centric perspective to the critical discussion on DNS logs. While her background spans architecture and design, her daily role involves managing complex client relationships within the IPv4 address marketplace, where network stability is paramount. At InterLIR, a Berlin-based firm dedicated to solving network availability problems, Vladislava understands that efficient resource allocation relies heavily on accurate data. The insights from APRICOT 2026 regarding DNS log analysis directly correlate to her work ensuring clients maintain clean BGP routes and optimal IP reputation. By connecting operational data like DNS metrics to tangible business outcomes, she helps clients avoid the massive financial losses associated with network downtime. Her expertise bridges the gap between technical instrumentation and strategic resource management, proving that understanding network behavior is essential for any organization relying on reliable IP infrastructure.

Conclusion

Scaling DNS log analysis reveals that raw visibility becomes a liability without rigorous data isolation. While public resolver adoption accelerates, the operational reality shifts from simple collection to managing the blast radius of exposed flow tables. The market's rapid expansion toward managed services confirms that internal teams can no longer sustain the engineering burden of building custom security proxies for every open-source tool. True scalability demands accepting that unfiltered access to ClickHouse backends is an unacceptable risk, regardless of the latency gains or cost savings initially promised by direct deployment.

Organizations must commit to a hybrid architecture within the next two quarters, mandating a dedicated visualization layer like Grafana to enforce strict RBAC before any data reaches user dashboards. Do not attempt to scale direct Akvorado exposure; the privacy debt incurred by leaking topology data will far exceed the infrastructure spend of a proper middleware wrapper. Start by auditing your current flow collector permissions this week to identify any accounts with direct SQL access to raw logs, then immediately revoke them in favor of aggregated API endpoints. This single action prevents catastrophic intelligence leaks while you engineer the necessary sanitization pipeline. The era of trusting network boundaries alone is over; data-level enforcement is now the only viable path forward for sustainable network operations.

Frequently Asked Questions

How much DNS traffic do ISPs lose to public resolvers like Google?
Public resolvers handle 60% to 70% of visible DNS queries for many ISPs. This delegation prevents operators from seeing which cache nodes serve their users, hiding critical data about on-net versus global traffic patterns.
What latency reduction can granular log analysis achieve during user onboarding?
Correlating error codes with cache efficiency drives a 20% reduction in first-time user latency. This improvement proves that accessing granular log data directly enhances business metrics rather than leaving CDN tuning as a speculative exercise.
At what annual rate has public DNS adoption grown since 2011?
Public DNS adoption expands at an annual rate of 27% since the 2011 baseline. This rapid growth increases the operational blind spot for ISPs regarding specific latency issues and rising transit costs across their networks.
Why do operators need local DNS termination instead of third-party resolution?
Third-party resolution sacrifices control over traffic engineering outcomes while gaining speed for the network. Without local termination, asset owners cannot distinguish between on-net and global traffic sources or validate routing policies effectively.
Which open source tools build real-time DNS visualization pipelines effectively?
PowerDNS Recursor, ClickHouse, and Grafana form a cost-effective stack for visualizing DNS queries. This combination allows engineers to map CDN sources per query and classify placement without disrupting production environments.
Vladislava Shadrina
Vladislava Shadrina
Customer Account Manager