DFlash: Block Diffusion for Flash Speculative Decoding(Blackwell 6000 Pro)

Hello,

I was testing DFlash on a Blackwell 6000 Pro to evaluate how speculative decoding affects output generation. With DFlash speculative decoding enabled, generation throughput reached ~429.69 tokens/s, which is approximately, 4.8× higher throughput than the ~90.20 tokens/s observed without speculative decoding.

With speculative decoding enabled

export SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1

python -m sglang.launch_server \
    --model-path meta-llama/Llama-3.1-8B-Instruct \
    --speculative-algorithm DFLASH \
    --speculative-draft-model-path z-lab/LLaMA3.1-8B-Instruct-DFlash-UltraChat \
    --tp-size 2 \
    --dtype bfloat16 \
    --attention-backend flashinfer \
    --mem-fraction-static 0.75 \
    --trust-remote-code


[2026-02-07 14:38:41] Decode batch, #running-req: 1, #token: 492, token usage: 0.00, accept len: 6.10, accept rate: 0.61, cuda graph: True, gen throughput (token/s): 429.69, #queue-req: 0,
Output

Quantum computing is a revolutionary technology that uses the principles of quantum mechanics to perform calculations and operations on data. Here are some key aspects:

**Key Features:**

1. **Quantum Bits (Qubits):** Qubits are the fundamental units of quantum information, unlike classical bits which can only be 0 or 1. Qubits can exist in multiple states simultaneously, known as a superposition.
2. **Entanglement:** Qubits can become "entangled," meaning their properties are connected, even when separated by large distances.
3. **Quantum Gates:** Quantum gates are the quantum equivalent of logic gates in classical computing. They perform operations on qubits, such as rotations and entanglement.
4. **Superposition:** Qubits can exist in multiple states simultaneously, allowing for parallel processing of vast amounts of data.

**Advantages:**

1. **Exponential Scaling:** Quantum computers can solve certain problems exponentially faster than classical computers.
2. **Simulation:** Quantum computers can simulate complex systems, such as molecules and materials, which can lead to breakthroughs in fields like chemistry and materials science.
3. **Cryptography:** Quantum computers can break certain classical encryption algorithms, but they can also be used to create unbreakable quantum encryption.

**Challenges:**

1. **Error Correction:** Quantum computers are prone to errors due to the fragile nature of qubits.
2. **Scalability:** Currently, quantum computers are small-scale and need to be scaled up to perform practical calculations.
3. **Noise:** Quantum computers are sensitive to noise, which can cause errors and affect the accuracy of calculations.

**Applications:**

1. **Optimization:** Quantum computers can be used to optimize complex systems, such as logistics and finance.
2. **Machine Learning:** Quantum computers can be used to speed up machine learning algorithms and improve their accuracy.
3. **Cryptography:** Quantum computers can be used to create unbreakable encryption and secure communication.

**Current Status:**

1. **Research:** Quantum computing is an active area of research, with many companies and organizations working on developing quantum computers.
2. **Prototypes:** Several quantum computer prototypes have been built, including IBM's Quantum Experience and Google's Bristlecone.
3. **Commercialization:** Quantum computing is expected to become more commercialized in the coming years, with companies like IBM, Google, and Microsoft leading the charge.

Overall, quantum computing has the potential to revolutionize many fields and solve complex problems that are currently unsolvable with classical computers. However, significant technical challenges need to be overcome before quantum computing can become a practical reality.

Without speculative decoding

    python -m sglang.launch_server \
    --model-path meta-llama/Llama-3.1-8B-Instruct \
    --tp-size 1 \
    --dtype bfloat16 \
    --attention-backend flashinfer \
    --mem-fraction-static 0.75 \
    --trust-remote-code

[2026-02-07 14:42:55] Decode batch, #running-req: 1, #token: 568, token usage: 0.00, cuda graph: True, gen throughput (token/s): 90.20, #queue-req: 0,

Output:

Quantum computing is a new and rapidly evolving field of computer science that uses the principles of quantum mechanics to perform calculations and operations on data. Here are some key aspects:

**Key Principles:**

1. **Qubits**: Quantum computers use quantum bits or qubits, which can exist in multiple states (0, 1, or both) simultaneously, unlike classical bits which can only be 0 or 1.
2. **Superposition**: Qubits can exist in a state of superposition, meaning they can process multiple possibilities simultaneously.
3. **Entanglement**: Qubits can become entangled, allowing them to be connected and correlated in a way that enables faster processing of complex calculations.

**Advantages:**

1. **Exponential scaling**: Quantum computers can process exponentially more information than classical computers for certain problems.
2. **Faster simulations**: Quantum computers can simulate complex systems, such as molecules and materials, much faster than classical computers.
3. **Cryptography**: Quantum computers can break certain classical encryption algorithms, but they can also be used to create unbreakable quantum encryption.

**Challenges:**

1. **Error correction**: Quantum computers are prone to errors due to the fragile nature of qubits.
2. **Scalability**: Currently, quantum computers are small-scale and need to be scaled up to be practical.
3. **Control and calibration**: Maintaining control over qubits and calibrating the quantum computer is a significant challenge.

**Applications:**

1. **Optimization**: Quantum computers can optimize complex systems, such as logistics and finance.
2. **Materials science**: Quantum computers can simulate the behavior of materials, leading to breakthroughs in fields like chemistry and materials science.
3. **Cryptography**: Quantum computers can create unbreakable encryption, which is essential for secure communication.

**Companies and Research Institutions:**

1. **IBM Quantum**: A leading provider of quantum computing services and hardware.
2. **Google Quantum AI Lab**: A research initiative focused on developing quantum computing applications.
3. **Microsoft Quantum**: A research and development effort focused on quantum computing and its applications.

**Timeline:**

1. **1980s**: Quantum computing concepts were first proposed.
2. **1990s**: First quantum computers were built.
3. **2000s**: Quantum computing research accelerated.
4. **2010s**: First commercial quantum computing services were launched.
5. **2020s**: Quantum computing is becoming more accessible and practical.

This is a brief overview of quantum computing. If you have specific questions or want more information, feel free to ask!

Before I put this on DGX … does it need additional memory bandwidth (rtx pro has 1800 GB/s, spark only 273) GB/s,

I noticed, GLM-4.7 is coming to the dflash community. Still one gpu only?

Hi @flash3, indeed it can increase token generation speed, but in my testing with MMLU, dflash affects accuracy, making it worse than vanilla models. Vanilla Llama-3.1-8B-Instruct models solved the math problem correctly.

Question: Find the product of the given polynomials in the given polynomial ring. f(x) = 4x - 5, g(x) = 2x^2 - 4x + 2 in Z_8.

Answer using dflash draft decoding: To meta-llama/Llama-3.1-8B-Instruct


To find the product of the given polynomials in Z_8[x], we need to multiply them as we would in a regular polynomial ring, but we need to perform the operations modulo 8.

f(x) = 4x - 5

g(x) = 2x^2 - 4x + 2

First, we multiply each term of f(x) by each term of g(x):

(4x)(2x^2) = 8x^3 (mod 8) = 0x^3

(4x)(-4x) = -16x^2 (mod 8) = 0x^2

(4x)(2) = 8x (mod 8) = 0x

(-5)(2x^2) = -10x^2 (mod 8) = 6x^2

(-5)(-4x) = 20x (mod 8) = 4x

(-5)(2) = -10 (mod 8) = 6

Now, we combine like terms:

0x^3 + 0x^2 + 0x + 6x^2 + 4x + 6

Combine like terms:

-6x^2 + 4x + 6

So, the product of f(x) and g(x) in Z_8[x] is -6x^2 + 4x + 6.

Answer without dflash (meta-llama/Llama-3.1-8B-Instruct):


To find the product of the given polynomials in the polynomial ring Z_8[x], we'll follow the standard multiplication process for polynomials, but we'll perform the arithmetic operations in Z_8.

f(x) = 4x - 5

g(x) = 2x^2 - 4x + 2

First, multiply each term of f(x) by each term of g(x):

(4x * 2x^2) + (4x * (-4x)) + (4x * 2) + ((-5) * 2x^2) + ((-5) * (-4x)) + ((-5) * 2)

Now, simplify the expressions using Z_8 arithmetic:

(8x^3) mod 8 = 0 (since 8 mod 8 = 0)

(-16x^2) mod 8 = 0 (since 16 mod 8 = 0)

(8x) mod 8 = 0 (since 8 mod 8 = 0)

(-10x^2) mod 8 = -2x^2 (since 10 mod 8 = 2)

(20x) mod 8 = 4x (since 20 mod 8 = 4)

(-10) mod 8 = 6 (since 10 mod 8 = 2)

Now, combine like terms:

0 + 0 + 0 - 2x^2 + 4x + 6

The resulting polynomial is:

-2x^2 + 4x + 6

However, this polynomial is in Z[x] (the ring of polynomials with integer coefficients). Since we're working in Z_8[x], we need to express the coefficients in Z_8.

In Z_8, the coefficients -2 and 6 can be expressed as 6 and 6, respectively.

So, the product of the given polynomials in Z_8[x] is:

6x^2 + 4x + 6


I took the Spark image from baristankut and I’m trying to build an SGLang/Dflash image on top of it. Long way done, but further migration hurts. None of the components fit together — mostly because SM121 is still unknown territory. Is it sm121, sm121a, or…? But it’s in progress.

You’re right, normally the wrong drafts should just be rejected.

yes, the biggest blocker right now is exactly what you pointed out: SM12X is still a unknown territory. It needs further work from the vLLM and SGLang teams to fully optimize everything.

So far, the best model I have discovered is Qwen/Qwen3-8B using Dflash. I hope Dflash guys will add support for more models.

python -m sglang.launch_server \
    --model-path Qwen/Qwen3-8B \
    --speculative-algorithm DFLASH \
    --speculative-draft-model-path z-lab/Qwen3-8B-DFlash-b16 \
    --tp-size 2 \
    --dtype bfloat16 \
    --attention-backend flashinfer \
    --mem-fraction-static 0.75 \
    --trust-remote-code