DeepSeek Unveils DualPath: Novel Framework Leverages Idle Network Cards to Accelerate Agent Inference, Breaks PD Separation Bottleneck

DeepSeek has quietly released a research paper outlining a novel inference framework, potentially hinting at infrastructure advancements for its anticipated V4 model. The framework, named DualPath, addresses a critical performance bottleneck in large language model agents.

While the AI community eagerly awaits DeepSeek's next major model release, the company's research team, alongside collaborators from Peking University and Tsinghua University, has published a paper on arXiv detailing DualPath. This new framework is specifically engineered for agent systems performing long-context, multi-turn reasoning.

The Core Challenge: The I/O Wall

In contemporary agent applications, conversations are long and involve numerous turns. This leads to a very high KV-Cache hit rate (often over 95%), meaning the system constantly needs to reload vast amounts of "historical memory" from external storage for each new interaction. The performance bottleneck has thus shifted from pure computation to the "data movement" or I/O required to fetch this cache.

The traditional Prefill-Decode disaggregated (PD-disaggregated) architecture exacerbates this issue. All data loading tasks congest the storage network card (SNIC) on the Prefill Engine (PE), causing bandwidth saturation. Meanwhile, the SNICs on the Decode Engines (DE) remain largely idle, creating a severe resource mismatch.

DualPath's Innovative Solution

DualPath's fundamental insight is that KV-Cache loading does not have to be centered on the Prefill Engine. It breaks the traditional single-path "Storage-to-Prefill" model by introducing a second path: "Storage-to-Decode."

Path A (Traditional): Storage → Prefill Engine (PE).
Path B (New): Storage → Decode Engine (DE) → Prefill Engine (PE).

In Path B, the KV-Cache is first loaded into a buffer on an idle Decode Engine. It is then transferred to the Prefill Engine via a high-bandwidth compute network (using RDMA). A central scheduler dynamically decides which path each request should take, enabling global pooling of storage bandwidth and dynamic load balancing across the entire cluster.

Architectural Components & Optimizations

The framework consists of three main components:

Inference Engines: GPUs strictly designated as Prefill (PE) or Decode (DE) engines.
Traffic Manager: Handles host-to-device/device-to-host copies, inter-engine transfers, and SNIC storage I/O.
Central Scheduler: Acts as the "brain," making real-time routing decisions to maximize global bandwidth utilization.

To prevent the new data path from interfering with critical model computation traffic, DualPath implements two key optimizations:

Compute-Network-Centric Traffic Management: All traffic is forced through paired Compute Network Interface Cards (CNICs) using GPUDirect RDMA. Techniques like Virtual Lanes (VL) in InfiniBand/RoCE networks are used to prioritize inference communication, reserving 99% of the bandwidth for it and allowing cache transfer traffic to use only the remaining gaps.
Adaptive Request Scheduler: The scheduler monitors disk queue lengths and token counts on each node, preferentially assigning tasks to nodes with lower I/O pressure and lighter computational loads to avoid congestion.

Performance Gains

Tested on production-scale 660B parameter models (including DeepSeek-V3 and Qwen), DualPath demonstrated substantial improvements:

Offline Inference Throughput: Increased by up to 1.87x.
Online Service Throughput: Improved by an average of 1.96x.
Latency: Significantly reduced Time-To-First-Token (TTFT) under high load, while maintaining stable and nearly unaffected Token-to-Token latency (TBT/TPOT).

Significance and Implications

DualPath proves that rethinking data loading pathways can effectively break the current "I/O wall" in large model inference. By harnessing the wasted I/O bandwidth on decoding engines and combining it with intelligent scheduling and strict traffic isolation, the framework delivers major efficiency gains for agent-based LLM systems without requiring additional hardware investment.

The first author of the paper is Yongtong Wu, a Ph.D. candidate at Peking University advised by Professor Xin Jin, who is currently interning with DeepSeek's systems team working on next-generation inference infrastructure.

Reference: arXiv:2602.21548