(* Equal Contribution, † Corresponding Author)
* WeDLM achieves significantly lower latency by predicting future tokens in parallel blocks.
Autoregressive (AR) generation is the standard decoding paradigm for Large Language Models (LLMs), but its token-by-token nature limits parallelism at inference time. Diffusion Language Models (DLLMs) offer parallel decoding by recovering multiple masked tokens per step; however, in practice they often fail to translate this parallelism into deployment speed gains over optimized AR engines (e.g., vLLM). A key reason is that many DLLMs rely on bidirectional attention, which breaks standard prefix KV caching and forces repeated contextualization, undermining efficiency.
We propose WeDLM, a diffusion decoding framework built entirely on standard causal attention to make parallel generation prefix-cache friendly. The core idea is to let each masked position condition on all currently observed tokens while keeping a strict causal mask, achieved by Topological Reordering that moves observed tokens to the physical prefix while preserving their logical positions. Building on this property, we introduce a streaming decoding procedure that continuously commits confident tokens into a growing left-to-right prefix and maintains a fixed parallel workload, avoiding the stop-and-wait behavior common in block diffusion methods.
Experiments show that WeDLM preserves the quality of strong AR backbones while delivering substantial speedups, approaching 3× on challenging reasoning benchmarks and up to 10× in low-entropy generation regimes; critically, our comparisons are against AR baselines served by vLLM under matched deployment settings, demonstrating that diffusion-style decoding can outperform an optimized AR engine in practice.
We propose WeDLM, a DLLM framework that performs mask recovery entirely under causal attention via Topological Reordering. This design enables seamless initialization from pre-trained AR checkpoints and inherent prefix-cache compatibility.
Predicted tokens can be cached immediately without waiting for subsequent positions. The strict causal mask ensures KV states depend only on committed context and can be reused immediately.
A decoding strategy with distance penalty that promotes left-to-right resolution, and a dynamic sliding window that continuously refills new masks as finalized tokens are committed—eliminating the stop-and-wait bottleneck of block-wise methods.
We demonstrate that WeDLM surpasses optimized vLLM baselines in wall-clock speed, achieving more than 3× speedups on complex reasoning tasks while maintaining generation quality.
Speed vs. Accuracy. WeDLM-8B (Star) achieves ~3x speedup over vLLM-optimized Qwen3-8B while maintaining high accuracy, significantly outperforming prior diffusion models.
Holistic Capability. WeDLM matches or surpasses the strong capabilities of the Qwen3-8B-Instruct baseline across math, coding, and general knowledge benchmarks.
WeDLM introduces two key innovations to bridge the architectural gap between Diffusion Language Models and standard Causal Attention mechanisms.
Overview of the WeDLM training framework. Left: Topological Reordering physically shifts observed tokens to the prefix while preserving logical positions. Right: Dual-Stream Masking concatenates a clean Memory Stream with a masked Prediction Stream.
Standard Causal Attention uses a lower-triangular mask, hiding future tokens. WeDLM resolves this by decoupling physical memory order from logical position IDs.
Block Decoding vs. WeDLM Streaming Parallel Decoding. Block decoding suffers from stop-and-wait. In contrast, WeDLM uses standard causal attention with a dynamic sliding window: resolved tokens are immediately cache-ready and committed.
Traditional block-wise decoding suffers from "stop-and-wait" overhead. WeDLM introduces a Dynamic Sliding Window that eliminates pipeline bubbles.
WeDLM achieves state-of-the-art performance, consistently outperforming optimized AR engines (vLLM) and prior diffusion models across mathematical reasoning and code generation tasks.
As analyzed in the paper, WeDLM achieves peak throughput on low-entropy tasks. In this counting task (1 to 200), the model resolves 8.10 tokens per forward pass on average because the pattern is highly predictable.
📚 Want to learn more? Read our Technical Report for full details on WeDLM methodology! 🚀