WeDLM : Reconciling Diffusion Language Models with Standard Causal Attention for Fast Inference

Aiwei Liu^1,*,† Minghua He^1,2,* Shaoxun Zeng³ Sijun Zhang¹ Linhao Zhang¹ Chuhan Wu¹ Wei Jia¹ Yuan Liu¹ Xiao Zhou¹ Jie Zhou¹

¹ WeChat AI, Tencent ² Peking University ³ Tsinghua University

(* Equal Contribution, † Corresponding Author)

WeDLM Technical Report GitHub HuggingFace

speed_benchmark.py

vLLM / Autoregressive

Target: Qwen3-8B

User: Calculate the sum of all integers from 1 to 100.

Assistant:

Traditional sequential generation

0.00s

Latency

WeDLM (Ours)

Streaming Parallel

User: Calculate the sum of all integers from 1 to 100.

Assistant:

3~10x Speedup

0.00s

Latency

* WeDLM achieves significantly lower latency by predicting future tokens in parallel blocks.

Abstract

Autoregressive (AR) generation is the standard decoding paradigm for Large Language Models (LLMs), but its token-by-token nature limits parallelism at inference time. Diffusion Language Models (DLLMs) offer parallel decoding by recovering multiple masked tokens per step; however, in practice they often fail to translate this parallelism into deployment speed gains over optimized AR engines (e.g., vLLM). A key reason is that many DLLMs rely on bidirectional attention, which breaks standard prefix KV caching and forces repeated contextualization, undermining efficiency.

We propose WeDLM, a diffusion decoding framework built entirely on standard causal attention to make parallel generation prefix-cache friendly. The core idea is to let each masked position condition on all currently observed tokens while keeping a strict causal mask, achieved by Topological Reordering that moves observed tokens to the physical prefix while preserving their logical positions. Building on this property, we introduce a streaming decoding procedure that continuously commits confident tokens into a growing left-to-right prefix and maintains a fixed parallel workload, avoiding the stop-and-wait behavior common in block diffusion methods.

Experiments show that WeDLM preserves the quality of strong AR backbones while delivering substantial speedups, approaching 3× on challenging reasoning benchmarks and up to 10× in low-entropy generation regimes; critically, our comparisons are against AR baselines served by vLLM under matched deployment settings, demonstrating that diffusion-style decoding can outperform an optimized AR engine in practice.

Key Contributions

Causal Diffusion

We propose WeDLM, a DLLM framework that performs mask recovery entirely under causal attention via Topological Reordering. This design enables seamless initialization from pre-trained AR checkpoints and inherent prefix-cache compatibility.

Prefix-Cache Compatibility

Predicted tokens can be cached immediately without waiting for subsequent positions. The strict causal mask ensures KV states depend only on committed context and can be reused immediately.

Streaming Parallel Decoding

A decoding strategy with distance penalty that promotes left-to-right resolution, and a dynamic sliding window that continuously refills new masks as finalized tokens are committed—eliminating the stop-and-wait bottleneck of block-wise methods.

First DLLM to Outperform Industrial AR Engines

We demonstrate that WeDLM surpasses optimized vLLM baselines in wall-clock speed, achieving more than 3× speedups on complex reasoning tasks while maintaining generation quality.

Performance Overview

Speed vs. Accuracy. WeDLM-8B (Star) achieves ~3x speedup over vLLM-optimized Qwen3-8B while maintaining high accuracy, significantly outperforming prior diffusion models.

Holistic Capability. WeDLM matches or surpasses the strong capabilities of the Qwen3-8B-Instruct baseline across math, coding, and general knowledge benchmarks.

Methodology

WeDLM introduces two key innovations to bridge the architectural gap between Diffusion Language Models and standard Causal Attention mechanisms.

WeDLM Training

Causal Mask Recovery

Overview of the WeDLM training framework. Left: Topological Reordering physically shifts observed tokens to the prefix while preserving logical positions. Right: Dual-Stream Masking concatenates a clean Memory Stream with a masked Prediction Stream.

Topological Reordering

Standard Causal Attention uses a lower-triangular mask, hiding future tokens. WeDLM resolves this by decoupling physical memory order from logical position IDs.

Separates logical semantics (RoPE) from physical computation.
Moves 'Observed' tokens to the physical start of the sequence.
Enables 'Noisy' tokens to attend to known futures via standard masks.

Logical View

KV Cache

Active Query

WeDLM Inference

Streaming Parallel Decoding

Block Decoding vs. WeDLM Streaming Parallel Decoding. Block decoding suffers from stop-and-wait. In contrast, WeDLM uses standard causal attention with a dynamic sliding window: resolved tokens are immediately cache-ready and committed.

Step 1: Diffuse

Active Window

Dynamic Sliding Window

Traditional block-wise decoding suffers from "stop-and-wait" overhead. WeDLM introduces a Dynamic Sliding Window that eliminates pipeline bubbles.

Asynchronous Commit: Finalize tokens as soon as confident.
Zero Bubbles: Immediately slide window to ingest new tokens.
Maintains high GPU saturation unlike block-wise methods.

Experimental Results

WeDLM achieves state-of-the-art performance, consistently outperforming optimized AR engines (vLLM) and prior diffusion models across mathematical reasoning and code generation tasks.

Case Study: Low-Entropy Generation

The "Counting" Task

As analyzed in the paper, WeDLM achieves peak throughput on low-entropy tasks. In this counting task (1 to 200), the model resolves 8.10 tokens per forward pass on average because the pattern is highly predictable.

User: Count from 1 to 200

Assistant: Sure! Here are the numbers from 1 to 200:
1, 2, 3, 4, 5, ... [WeDLM generates strictly correct sequence] ... 199, 200.

1673.3

Tokens / Sec

0.038

Avg Entropy

14.17

Tokens/Pass

BibTeX

📚 Want to learn more? Read our Technical Report for full details on WeDLM methodology! 🚀

bibtex

@misc{liu2025wedlmreconcilingdiffusionlanguage,
      title={WeDLM: Reconciling Diffusion Language Models with Standard Causal Attention for Fast Inference}, 
      author={Aiwei Liu and Minghua He and Shaoxun Zeng and Sijun Zhang and Linhao Zhang and Chuhan Wu and Wei Jia and Yuan Liu and Xiao Zhou and Jie Zhou},
      year={2025},
      eprint={2512.22737},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2512.22737}, 
}