I encountered that when I was trying to build my own harness but not with standard OpenWebUI, so it would be bombing out on something else..
I tried with a different jinja template and didnβt experience the issue anymore.
Iβm switching back to MiniMax M2.7, Iβll wait for the dust to settle to give 3.6 27B another chance, but my take right now is that the experience isnβt as polished as with M2.7. I feel 3.6 27B is a bit smarter, but the cost is sometimes massive overthinking and looping I havenβt experienced with M2.7.
Hopefully at some point the jinja template and parser will be better, but right now it doesnβt feel very good.
The preserved-thinking feature is interesting, but right now itβs a black box. While itβs good on paper, having it inference engine side is weird. Like how is it structured from call to call? How much context does it use? Or is it free βalongsideβ context? I probably need to find documentation about it. But whatβs the difference between this and managing the same feature harness-side?
How is the tool call? Do we still need the enhanced chat template and qwen_xml as Qwen 3.5?
From myself, I will add a review on the weights Intel/Qwen3.6-27B-int4-AutoRound Β· Hugging Face
Iβve been using the updated weights for two days! The only downside is the low t/s speed, but otherwise it works perfectly!!!
I also use Minimax 2.7, but there are often errors in the code, and the suggested solutions are not the best.
On the other hand, Qwen3.6-27B writes code stably from the first time, and the functions also work perfectly!
Very satisfied with the updated Qwen3.6-27B
where can I find the script to run this benchmark on my machine?
Thanks
look at the root of his project.
Iβve had his bench, and modified to run parallel tests for my workflow
Are you using a multi-GPU configuration with llama.cpp? If so, it does not really make much sense especially with NVLink.
This misleads people about the performance of 2 x RTX 3090 with NVLink
Impressive numbers! Could you please share the vLLM startup command and parameter configuration you used for this run?
anyone find out a good receipt for using in dual sparks, for 5/6 persons that can work together?
My testing is showing to run 2 separate instances and load-balance across the two. I should have better results later today.
Yes I am. My PC has 3 x RTX3090 installed, but it really does not make sense to run a small/medium model as 27-35B on all three. Iβm currently running with llama-cpp tensor parallelism (-sm layer), with works ok. Also tried ik-llama and -sm graph, which runs much more efficienct (especially WITH an NV-link, thatβs why I have it installed), but the general interface of ik-llama lags behind llama-cpp main branch in terms of server configuration.
Hi everyone,
Iβm an independent developer working on a local-first, real-time inference engine (FlashRT), mainly focused on small-batch and latency-critical workloads (agents, robotics, etc.).
So far, Iβve validated strong performance in real-world setups:
-
VLA models: 2β5Γ faster than TRT, 10Γ+ vs original pipelines
-
Qwen3.6 27B (NVFP4): 100+ tok/s on a single RTX 5090(will release soon, working on making context support 256k by Turboquant)
The goal is to make truly real-time local AI practical, rather than optimizing for large-batch throughput.
Iβm currently expanding support to more models β would really appreciate any feedback or if youβd like to try it out:
https://github.com/LiangSu8899/FlashRT
Thanks for sharing. Curious to try it out. Would you mind sharing your recipes? Especially the Qwen3.6 27B (NVFP4) sounds interesting to try on DGX Spark. Thanks in advance. @7thuniversels
Thanks! Happy to share a bit more.
The current setup is mainly based on NVFP4 quantization + custom CUDA kernels, with a focus on reducing memory movement and kernel launch overhead (especially removing Q/DQ in the hot path). Itβs designed for small-batch and real-time inference, but the same ideas translate well to higher-throughput setups.
I think it should be a good fit for DGX Spark as well β especially with its strong memory bandwidth and compute, the FP4 + fused kernel approach can scale nicely beyond just low-latency cases.
Iβm currently working on improving long-context throughput, and planning to push an update for Qwen3.6 in the next couple of days on my repo. Would love for you to try it out when itβs up β any feedback would be super helpful!
New version out unsloth/Qwen3.6-27B-NVFP4 Β· Hugging Face
Whatβs the best way to run this new unsloth version? I just tried creating a new recipe for spark-vllm-docker based on a PR. It launched but froze my DGX Spark. (Subsequent attempt repeated memory usage consumption).
set gpu_memory_utilization: 0.60 or something as nvfp4 gets calculation of size -xxx then he tries to allocate memory you set - that -xxx so in the end is allocated 0.6 you set + xxx * 2 due to calculation bug and you get out of memory and spark freezes. learned it hard way like you :D
I went through two attempts at getting the Unsloth NVFP4 quant running just before I saw your note here. Both failed miserably with endless OOMs until the box eventually killed nearly everything and let me login again.
How in the world did you figure out the memory math bug? Is it referenced elsewhere?
Just common logic :D it says allocated memory -XXX. That is signal that math is failing and kv will try allocate remaining one :)
