Qwen3.6-27B is out!

I encountered that when I was trying to build my own harness but not with standard OpenWebUI, so it would be bombing out on something else..

I tried with a different jinja template and didn’t experience the issue anymore.

I’m switching back to MiniMax M2.7, I’ll wait for the dust to settle to give 3.6 27B another chance, but my take right now is that the experience isn’t as polished as with M2.7. I feel 3.6 27B is a bit smarter, but the cost is sometimes massive overthinking and looping I haven’t experienced with M2.7.

Hopefully at some point the jinja template and parser will be better, but right now it doesn’t feel very good.

The preserved-thinking feature is interesting, but right now it’s a black box. While it’s good on paper, having it inference engine side is weird. Like how is it structured from call to call? How much context does it use? Or is it free β€œalongside” context? I probably need to find documentation about it. But what’s the difference between this and managing the same feature harness-side?

How is the tool call? Do we still need the enhanced chat template and qwen_xml as Qwen 3.5?

Thanks for recipe. Happy with results i got. But would recommend to use max_num_seqs: 30 or something then you can spin more then 4 request at a time. Spark can handle 20-30 easy.

From myself, I will add a review on the weights Intel/Qwen3.6-27B-int4-AutoRound Β· Hugging Face
I’ve been using the updated weights for two days! The only downside is the low t/s speed, but otherwise it works perfectly!!!
I also use Minimax 2.7, but there are often errors in the code, and the suggested solutions are not the best.
On the other hand, Qwen3.6-27B writes code stably from the first time, and the functions also work perfectly!
Very satisfied with the updated Qwen3.6-27B

where can I find the script to run this benchmark on my machine?

Thanks

look at the root of his project.

I’ve had his bench, and modified to run parallel tests for my workflow

Are you using a multi-GPU configuration with llama.cpp? If so, it does not really make much sense especially with NVLink.
This misleads people about the performance of 2 x RTX 3090 with NVLink

Impressive numbers! Could you please share the vLLM startup command and parameter configuration you used for this run?

anyone find out a good receipt for using in dual sparks, for 5/6 persons that can work together?

My testing is showing to run 2 separate instances and load-balance across the two. I should have better results later today.

Yes I am. My PC has 3 x RTX3090 installed, but it really does not make sense to run a small/medium model as 27-35B on all three. I’m currently running with llama-cpp tensor parallelism (-sm layer), with works ok. Also tried ik-llama and -sm graph, which runs much more efficienct (especially WITH an NV-link, that’s why I have it installed), but the general interface of ik-llama lags behind llama-cpp main branch in terms of server configuration.

Hi everyone,

I’m an independent developer working on a local-first, real-time inference engine (FlashRT), mainly focused on small-batch and latency-critical workloads (agents, robotics, etc.).

So far, I’ve validated strong performance in real-world setups:

  • VLA models: 2–5Γ— faster than TRT, 10Γ—+ vs original pipelines

  • Qwen3.6 27B (NVFP4): 100+ tok/s on a single RTX 5090(will release soon, working on making context support 256k by Turboquant)

The goal is to make truly real-time local AI practical, rather than optimizing for large-batch throughput.

I’m currently expanding support to more models β€” would really appreciate any feedback or if you’d like to try it out:
https://github.com/LiangSu8899/FlashRT

Thanks for sharing. Curious to try it out. Would you mind sharing your recipes? Especially the Qwen3.6 27B (NVFP4) sounds interesting to try on DGX Spark. Thanks in advance. @7thuniversels

Thanks! Happy to share a bit more.

The current setup is mainly based on NVFP4 quantization + custom CUDA kernels, with a focus on reducing memory movement and kernel launch overhead (especially removing Q/DQ in the hot path). It’s designed for small-batch and real-time inference, but the same ideas translate well to higher-throughput setups.

I think it should be a good fit for DGX Spark as well β€” especially with its strong memory bandwidth and compute, the FP4 + fused kernel approach can scale nicely beyond just low-latency cases.

I’m currently working on improving long-context throughput, and planning to push an update for Qwen3.6 in the next couple of days on my repo. Would love for you to try it out when it’s up β€” any feedback would be super helpful!

New version out unsloth/Qwen3.6-27B-NVFP4 Β· Hugging Face

What’s the best way to run this new unsloth version? I just tried creating a new recipe for spark-vllm-docker based on a PR. It launched but froze my DGX Spark. (Subsequent attempt repeated memory usage consumption).

set gpu_memory_utilization: 0.60 or something as nvfp4 gets calculation of size -xxx then he tries to allocate memory you set - that -xxx so in the end is allocated 0.6 you set + xxx * 2 due to calculation bug and you get out of memory and spark freezes. learned it hard way like you :D

I went through two attempts at getting the Unsloth NVFP4 quant running just before I saw your note here. Both failed miserably with endless OOMs until the box eventually killed nearly everything and let me login again.

How in the world did you figure out the memory math bug? Is it referenced elsewhere?

Just common logic :D it says allocated memory -XXX. That is signal that math is failing and kv will try allocate remaining one :)