Cloudflare edge shift: Why 2MB cache matters

Blog 13 min read

Cloudflare's new Gen 13 servers cut per-core L3 cache to just 2MB, a sixth of the previous generation's allocation. A tour inside cloudflares latest generation servers This hardware reality forces a fundamental architectural pivot: high-density edge infrastructure can no longer rely on massive caches to mask software inefficiencies. The era of cache-heavy reliance is over, replaced by a core-dense model where performance scales strictly through software optimization and thread isolation.

While the new silicon offers up to 192 cores and improved instructions-per-cycle via the Zen 5 architecture, the drastic reduction in shared L3 cache creates severe contention for legacy stacks like NGINX and LuaJIT. Readers will learn how Cloudflare resolved these bottlenecks by decoupling performance from cache locality, enabling the network to serve over 41 million websites efficiently. We examine the strategic shift toward workload isolation using PQOS and detailed CPU profiling to maintain strict SLAs despite the leaner memory hierarchy. Finally, we explore how eliminating dependencies on large caches allows modern edge networks to fully exploit the 384 threads and power efficiency of next-generation silicon.

The Strategic Shift from Cache-Heavy to Core-Dense Edge Infrastructure

Defining the Gen 13 Shift from 3D V-Cache to Core Density

Gen 13 servers apply AMD EPYC™ Turin processors with 192 cores, sacrificing per-core cache for density according to Cloudflare blog data. This architectural pivot defines a move away from the 3D V-Cache reliance of the 12th Generation fleet toward raw parallelism. The FL2 request handling layer, a Rust-based rewrite, enables this transition by removing dependencies on large cache pools that previously masked memory latency. A L3 cache miss occurs when requested data is absent from the shared last-level cache, forcing a slow fetch from main memory that stalls CPU cycles. The shift creates a specific tension: AMD EPYC™ Turin allocates just 2MB of L3 cache per core, one-sixth of the 12MB found in prior AMD EPYC™ Genoa-X designs.

Applying AMD EPYC Turin Specs for Edge Compute Scaling

Cloudflare blog data shows Gen 13 servers deploy 192-core AMD EPYC™ Turin processors to maximize parallel request handling. This configuration directly addresses how core count affects edge compute by enabling massive thread concurrency without proportional latency increases. The architecture supports 384 threads via Simultaneous Multithreading, allowing the FL2 layer to distribute load across dense physical resources efficiently. Operators should consider upgrading server hardware when cache misses dominate cycle counts rather than pure throughput limits. Memory bandwidth becomes the primary constraint as core density scales beyond traditional ratios. According to Cloudflare blog, each unit includes 768 GB DDR5-6400 memory to prevent starvation during high-contention periods. Local storage speed further dictates whether rapid state retrieval maintains Service Level Agreements under load. As reported by Cloudflare blog, 24 TB PCIe 5.0 NVMe storage ensures persistent data access does not bottleneck the CPU pipeline.

ComponentSpecificationOperational Impact
Processor192 CoresMaximizes parallel thread execution
Memory768 GB DDR5Prevents bandwidth starvation
Storage24 TB NVMeEliminates I/O wait states

The limitation is that raw core count offers no benefit if the software stack cannot schedule tasks without lock contention. Without a Rust-based runtime or equivalent non-blocking architecture, additional cores may increase context-switching overhead rather than useful work.

Comparing L3 Cache Per Core: Turin High-Density vs Genoa-X

Cloudflare Turin high-density OPNs provide 2MB L3 cache per core versus 12MB in Genoa-X architectures. This six-fold reduction forces a fundamental re-evaluation of how edge compute workloads access shared memory resources without incurring latency penalties. A L3 cache miss occurs when the processor cannot find requested data in the last-level cache, necessitating a slow retrieval from main DRAM that stalls execution threads.

Meanwhile, per cloudflare blog, the new architecture consumes up to 32% fewer watts per core compared to the previous generation despite higher density. The trade-off is that legacy software stacks relying on large cache pools suffer severe throughput degradation when hitting memory walls frequently. FL2 resolves this bottleneck by minimizing state retention in the CPU cache, allowing the system to tolerate higher miss rates without collapsing latency profiles. Operators deploying similar high-density configurations must prioritize memory-efficient code paths over raw clock speed to avoid performance cliffs.

based on L3 Cache Miss Penalties in NGINX and LuaJIT Runtimes

Cloudflare blog, L3 cache hits complete in roughly 50 cycles, while misses requiring DRAM access take 350+ cycles. This disparity defines the mechanical failure of FL1 runtimes on high-density silicon. The NGINX and LuaJIT components within FL1 rely heavily on cache locality for instruction execution and state management. When the processor cannot locate data in the shared L3 cache, the core stalls while waiting for memory retrieval. This stall state propagates latency through the request handling pipeline. With 6x less cache per core, FL1 on Gen 13 hit memory more often, incurring latency penalties according to Cloudflare blog data. The AMD EPYC™ Turin architecture prioritizes thread count over individual core cache size. Consequently, workloads designed for 3D V-Cache environments face immediate contention. The penalty is not linear; it scales with utilization as available cache lines diminish.

MetricLow UtilizationHigh Utilization
Cache AvailabilitySufficient for burstsExhausted by concurrency
Latency ImpactModerate increaseSevere degradation
Throughput GainMargimalSignificant but costly

The cost is predictable performance decay under load. Operators running legacy stacks on dense cores will observe throughput gains masked by tail latency spikes.

According to Cloudflare blog, AMD's Platform Quality of Service (PQOS) enables fine-grained regulation of shared cache and memory bandwidth. This mechanism isolates latency-sensitive workloads by assigning dedicated slices of the L3 cache to specific core groups. Turin processors consist of one I/O Die and up to 12 Core Complex Dies (CCDs), each sharing an L3 cache across up to 16 cores according to Cloudflare blog data. Operators can map critical threads to entire CCDs, preventing noisy neighbors from evicting hot data sets. However, dedicating whole CCDs reduces the total pool available for background batch processing tasks. The cost is a potential underutilization of silicon if the isolated workload does not saturate its assigned resources.

Single CCD ShareMinimal isolationNegligible gain
Socket LevelDedicated CCDAcceptable stability
Full IsolationExclusive accessMaximum predictability

High core density creates contention that software tuning alone cannot resolve without hardware enforcement. The limitation is that PQOS requires OS-level support and careful thread affinity mapping to function correctly. Network engineers must align process scheduling with physical CCD boundaries to realize benefits. Failure to coordinate software threads with hardware topology renders the isolation ineffective. Precision in configuration dictates whether high-core servers deliver speed or chaos.

as reported by Risks of Hardware Prefetcher Tuning and Worker Scaling Limits

Cloudflare blog, hardware prefetcher and Data Fabric Probe Filter adjustments yielded only marginal gains for FL1 workloads. Tuning these low-level hardware mechanisms fails to compensate for the fundamental architectural mismatch between legacy code paths and reduced cache density. The latency penalty from DRAM access remains dominant regardless of aggressive prefetching strategies. Scaling FL1 workers to apply additional cores improved raw throughput but cannibalized resources from other production services according to Cloudflare blog data. This approach creates a zero-sum game where edge compute capacity directly degrades auxiliary system stability. Operators prioritizing core count over cache must isolate critical services to prevent resource starvation across the host.

Tuning MethodOutcomeOperational Impact
Prefetcher/DF FiltersMarginal gainsHigh effort, negligible return
Worker ScalingThroughput upService degradation elsewhere
PQOS AllocationLatency controlledRequires complex configuration

The limitation is that brute-forcing concurrency without software architecture changes merely shifts the bottleneck from CPU cycles to memory bandwidth contention. Pure core scaling cannot resolve latency spikes caused by cache misses.

Optimizing Workload Isolation and CPU Performance Through PQOS and Profiling

Implementation: AMD Platform Quality of Service (PQOS) Mechanism for Cache Regulation

Cloudflare blog data confirms PQOS regulates shared cache and memory bandwidth across 12 Core Complex Dies. This mechanism isolates latency-sensitive threads by assigning dedicated L3 cache slices to specific core groups, preventing eviction by noisy neighbors. Turin processors share a 384MB pool across the socket, making such regulation mandatory for high-density workloads. Operators must map critical paths to entire CCDs to maintain performance consistency under load.

  1. Identify the target workload requiring strict latency bounds.
  2. Allocate a full CCD exclusively to that workload using PQOS policies.
  3. Restrict remaining cores to background or batch processing tasks.

The limitation is that dedicating whole CCDs reduces the silicon available for throughput-bound background jobs. This trade-off forces a choice between absolute latency guarantees and maximum aggregate throughput. Without such isolation, cache contention degrades edge compute reliability regardless of core count.

Measuring L3 Cache Misses with AMD uProf on FL1 Workloads

In practice, per cloudflare Engineering Team, AMD uProf collected counters revealing L3 cache miss spikes during Gen 13 evaluation. Operators must install the tool and attach it to running NGINX processes to capture cycle-accurate metrics. 1. Execute the profiler against the target binary while simulating production traffic loads. 2. Filter results to display only memory-related events like `MEM_INST_RETIRED` and `LLC_MISS`. 3. Correlate high miss counts with specific LuaJIT execution threads causing pipeline stalls. The analysis confirms that legacy code paths trigger frequent DRAM fetches when data evicts from the reduced cache pool. This behavior dominates request processing time, creating a bottleneck that raw core count cannot solve. However, relying solely on these metrics ignores the thermal throttling risks associated with sustained high-frequency memory access. Continuous polling of deep memory states increases power draw per transaction, potentially negating efficiency gains from the newer silicon architecture. Network engineers should treat high miss rates not as latency indicators but as signals for imminent thermal constraints on dense racks. ### Latency Penalties from DRAM Access in High-based on Density Core Configurations

Cloudflare Engineering Team, memory fetch latency dominated request processing as L3 misses forced trips to DRAM. Legacy FL1 code paths, optimized for larger caches, stall pipelines when data evicts to main memory. This bottleneck manifests sharply under load; according to Cloudflare Engineering Team, latency penalties scaled with utilization as CPU usage increased and cache contention worsened. Operators optimizing software for low-cache CPUs must recognize that raw throughput gains vanish if the application layer cannot keep hot data resident. Simply adding cores exacerbates the contention window without architectural changes to the request handler. : high core density demands memory-efficient software architectures like Rust-based rewrites to avoid saturating the memory bus. Ignoring this distinction leads to deployments where increased parallelism directly degrades tail latency performance.

Measurable ROI and Throughput Gains from the FL2 Software Rewrite

Quantifying FL2 Throughput Gains on AMD Turin 9965

Charts showing FL2 software delivers 100% throughput gain over Gen 12 compared to 62% for legacy, 70% latency reduction, and hybrid cloud market growing from $860B to $2.65T.
Charts showing FL2 software delivers 100% throughput gain over Gen 12 compared to 62% for legacy, 70% latency reduction, and hybrid cloud market growing from $860B to $2.65T.

Data from the Gen 13 deployment confirms FL2 delivers 100% higher throughput than Gen 12 servers. Legacy FL1 configurations achieved only a 62% increase over the previous generation, leaving substantial capacity unused due to cache contention. The Rust-based rewrite fundamentally alters memory access patterns, allowing the AMD Turin 9965 to apply its full core density without stalling on DRAM fetches. This architectural shift converts raw core count into linear performance scaling rather than diminished returns. Latency behavior dictates operational viability at scale. The software update slashed the latency penalty by 70%. Such reduction permits operators to push CPU utilization higher while strictly adhering to service level agreements. High-density deployments often fail when memory bottlenecks inflate tail latencies, forcing a choice between efficiency and responsiveness. FL2 resolves this tension by minimizing the footprint per request. Operators relying on legacy stacks cannot replicate these gains through hardware tuning alone. Pure core scaling fails without matching memory efficiency in the application logic.

Applying Linear Core Scaling to Edge Network Density

Throughput scales linearly with core count because FL2 eliminates cache bottlenecks. Legacy architectures stall when L3 cache density drops, but the Rust-based rewrite removes dependency on massive local storage. This mechanism allows the AMD Turin 9965 to apply its full 192-core complement without incurring DRAM latency penalties typical of cache misses. Operators can now maximize rack density while maintaining strict latency bounds previously impossible under high utilization. Rack throughput rises 60% versus Gen 12 for global edge upgrades. This gain permits smooth deployment across the network while keeping the rack power budget constant. The implication is substantial capacity expansion without proportional energy cost increases. Stakeholders achieve superior edge compute performance by pairing high-density silicon with memory-efficient software layers. Without software optimization, added cores merely increase contention windows rather than processing capacity. Gen 13 servers are ready to serve millions of requests, supporting a company that grew revenue by 29.85% yearoveryear to reach $2.168 billion in 2025. Pure hardware scaling fails if the request handler cannot feed cores efficiently.

FL2 Efficiency Gains Versus Legacy FL1 Cache Contention

Performance per watt improves 50% versus Gen 12, proving that leaner memory access patterns beat massive L3 caches. The legacy FL1 layer relied on large cache pools to hide DRAM latency, a design that collapsed when core density doubled without proportional cache growth. FL2 removes this dependency by restructuring how request data flows through memory, allowing linear scaling regardless of cache-to-core ratios. Most operators using standard NGINX configurations will not see these gains because their software stacks still chase cache hits rather than optimizing for bandwidth. Migrating to a Rust-based handler demands rigorous validation of memory safety guarantees before production rollout. Raw core count means nothing if the application layer cannot feed instructions fast enough. InterLIR recommends auditing current L3 cache miss rates before purchasing high-density servers, as legacy code will simply throttle throughput under load. The window for hardware-only upgrades has closed; software architecture now dictates whether additional cores provide value or just consume power. Operators ignoring this shift will face diminishing returns despite aggressive hardware refresh cycles.

About

Alexei Krylov Head of Sales at InterLIR brings a unique perspective to the evolution of server infrastructure like Gen 13. While his daily work focuses on optimizing IPv4 resource allocation and ensuring network availability, he understands that modern hardware efficiency directly impacts IP utilization strategies. The transition to Gen 13 servers, which prioritizes throughput over massive cache sizes via software rewrites like FL2, mirrors InterLIR's mission to maximize existing digital assets through transparency and efficiency. As organizations deploy these high-performance servers to handle increased traffic loads, the demand for clean, reliable IP addresses grows concurrently. Krylov's expertise in B2B sales and RIR interactions allows him to connect these infrastructure upgrades with the critical need for scalable network resources. At InterLIR, facilitating access to these essential components ensures that clients using next-gen hardware can fully realize their network potential without resource bottlenecks.

Conclusion

The era of solving latency with massive silicon real estate is over; cache starvation at scale exposes lazy software architectures immediately. When L3 allocation drops sixfold, the bottleneck shifts decisively from memory bandwidth to application-level data locality. Operators relying on legacy caching strategies will see their new hardware regress into power-hungry paperweights, as contention windows expand faster than core counts can compensate. The hidden operational cost here is not electricity, but the engineering debt required to refactor monolithic handlers for bandwidth-centric flows.

Commit to a software-first migration strategy before Q3 hardware refreshes. Do not deploy high-density Turin nodes unless your request path eliminates unnecessary memory lookups. The window for pure hardware upgrades has closed; future performance gains belong exclusively to stacks optimized for streaming data rather than cache hits. If your current architecture cannot sustain throughput with minimal L3 reliance, adding cores will only accelerate failure modes.

Start by auditing L3 miss rates on your production edge fleet this week using eBPF tools, specifically targeting functions with high memory churn. Establish a baseline threshold where miss rates exceed 15%, and flag any service crossing this line as incompatible with next-gen silicon until refactored. This diagnostic step prevents costly deployment failures and forces the necessary conversation about code efficiency before capital expenditure begins.

Frequently Asked Questions

Why did Cloudflare reduce L3 cache per core in Gen 13 servers?
Cloudflare prioritized core density over cache size to maximize parallel throughput. Turin processors allocate just 2MB L3 cache per core, enabling up to 192 cores compared to previous 96-core designs for greater edge scaling.
How much memory and storage prevent starvation in high-density Gen 13 units?
Each Gen 13 unit includes massive resources to feed the dense CPU cores. Cloudflare blog data shows each unit includes 768 GB DDR5 memory and 24 TB PCIe 5.0 NVMe storage to prevent starvation.
What latency penalty occurs when L3 cache misses force DRAM access?
Missing the L3 cache forces slow memory fetches that stall CPU cycles significantly. While hits take roughly 50 cycles, misses requiring DRAM access take 350+ cycles, creating an order of magnitude difference.
How much does power efficiency improve with AMD EPYC Turin versus Genoa-X?
The new architecture delivers better performance while consuming significantly less energy per core. Despite higher core counts, Turin consumes up to 32% fewer watts per core compared to prior Genoa-X architectures.
What throughput gains were achieved after rewriting the stack for Gen 13?
The Rust-based FL2 rewrite unlocked performance that legacy stacks could not achieve on Turin. Specific configurations showed throughput improvements of 31% or 62% over baseline Gen 12 performance metrics.
Alexei Krylov
Alexei Krylov
Head of Sales