WeDLM : Reconciling Diffusion Language Models with Standard Causal Attention for Fast Inference

Aiwei Liu1,*,† Minghua He1,2,* Shaoxun Zeng3 Sijun Zhang1 Linhao Zhang1 Chuhan Wu1 Wei Jia1 Yuan Liu1 Xiao Zhou1 Jie Zhou1
1 WeChat AI, Tencent 2 Peking University 3 Tsinghua University

(* Equal Contribution, † Corresponding Author)

speed_benchmark.py

vLLM / Autoregressive

Target: Qwen3-8B
User: Calculate the sum of all integers from 1 to 100.
Assistant:
Traditional sequential generation
0.00s
Latency

WeDLM (Ours)

Streaming Parallel
User: Calculate the sum of all integers from 1 to 100.
Assistant:
3~10x Speedup
0.00s
Latency

* WeDLM achieves significantly lower latency by predicting future tokens in parallel blocks.

Abstract

Autoregressive (AR) generation is the standard decoding paradigm for Large Language Models (LLMs), but its token-by-token nature limits parallelism at inference time. Diffusion Language Models (DLLMs) offer parallel decoding by recovering multiple masked tokens per step; however, in practice they often fail to translate this parallelism into deployment speed gains over optimized AR engines (e.g., vLLM). A key reason is that many DLLMs rely on bidirectional attention, which breaks standard prefix KV caching and forces repeated contextualization, undermining efficiency.

We propose WeDLM, a diffusion decoding framework built entirely on standard causal attention to make parallel generation prefix-cache friendly. The core idea is to let each masked position condition on all currently observed tokens while keeping a strict causal mask, achieved by Topological Reordering that moves observed tokens to the physical prefix while preserving their logical positions. Building on this property, we introduce a streaming decoding procedure that continuously commits confident tokens into a growing left-to-right prefix and maintains a fixed parallel workload, avoiding the stop-and-wait behavior common in block diffusion methods.

Experiments show that WeDLM preserves the quality of strong AR backbones while delivering substantial speedups, approaching on challenging reasoning benchmarks and up to 10× in low-entropy generation regimes; critically, our comparisons are against AR baselines served by vLLM under matched deployment settings, demonstrating that diffusion-style decoding can outperform an optimized AR engine in practice.

Key Contributions

Causal Diffusion

We propose WeDLM, a DLLM framework that performs mask recovery entirely under causal attention via Topological Reordering. This design enables seamless initialization from pre-trained AR checkpoints and inherent prefix-cache compatibility.

Prefix-Cache Compatibility

Predicted tokens can be cached immediately without waiting for subsequent positions. The strict causal mask ensures KV states depend only on committed context and can be reused immediately.

Streaming Parallel Decoding

A decoding strategy with distance penalty that promotes left-to-right resolution, and a dynamic sliding window that continuously refills new masks as finalized tokens are committed—eliminating the stop-and-wait bottleneck of block-wise methods.

First DLLM to Outperform Industrial AR Engines

We demonstrate that WeDLM surpasses optimized vLLM baselines in wall-clock speed, achieving more than speedups on complex reasoning tasks while maintaining generation quality.

Performance Overview

Performance-Speed Trade-off

Speed vs. Accuracy. WeDLM-8B (Star) achieves ~3x speedup over vLLM-optimized Qwen3-8B while maintaining high accuracy, significantly outperforming prior diffusion models.

Holistic Capability

Holistic Capability. WeDLM matches or surpasses the strong capabilities of the Qwen3-8B-Instruct baseline across math, coding, and general knowledge benchmarks.

Methodology

WeDLM introduces two key innovations to bridge the architectural gap between Diffusion Language Models and standard Causal Attention mechanisms.

01
WeDLM Training

Causal Mask Recovery

WeDLM Training Framework

Overview of the WeDLM training framework. Left: Topological Reordering physically shifts observed tokens to the prefix while preserving logical positions. Right: Dual-Stream Masking concatenates a clean Memory Stream with a masked Prediction Stream.

Topological Reordering

Standard Causal Attention uses a lower-triangular mask, hiding future tokens. WeDLM resolves this by decoupling physical memory order from logical position IDs.

  • Separates logical semantics (RoPE) from physical computation.
  • Moves 'Observed' tokens to the physical start of the sequence.
  • Enables 'Noisy' tokens to attend to known futures via standard masks.
Logical View

KV Cache
Active Query
02
WeDLM Inference

Streaming Parallel Decoding

WeDLM Streaming Parallel Decoding

Block Decoding vs. WeDLM Streaming Parallel Decoding. Block decoding suffers from stop-and-wait. In contrast, WeDLM uses standard causal attention with a dynamic sliding window: resolved tokens are immediately cache-ready and committed.

Step 1: Diffuse

Active Window

Dynamic Sliding Window

Traditional block-wise decoding suffers from "stop-and-wait" overhead. WeDLM introduces a Dynamic Sliding Window that eliminates pipeline bubbles.

  • Asynchronous Commit: Finalize tokens as soon as confident.
  • Zero Bubbles: Immediately slide window to ingest new tokens.
  • Maintains high GPU saturation unlike block-wise methods.

Experimental Results

WeDLM achieves state-of-the-art performance, consistently outperforming optimized AR engines (vLLM) and prior diffusion models across mathematical reasoning and code generation tasks.

Case Study: Low-Entropy Generation

The "Counting" Task

As analyzed in the paper, WeDLM achieves peak throughput on low-entropy tasks. In this counting task (1 to 200), the model resolves 8.10 tokens per forward pass on average because the pattern is highly predictable.

User: Count from 1 to 200
Assistant: Sure! Here are the numbers from 1 to 200:
1, 2, 3, 4, 5, ... [WeDLM generates strictly correct sequence] ... 199, 200.
1673.3
Tokens / Sec
0.038
Avg Entropy
14.17
Tokens/Pass

BibTeX (Coming Soon)

📚 Want to learn more? Read our Technical Report for full details on WeDLM methodology! 🚀