Qwen3.5-397B-A17B run in dual spark! but I have a concern

Intel released Qwen3.5-397B-A17B-int4-AutoRound on Hugging Face, and it runs well with vLLM when setting max-model-len to 262144. It outputs an average of 26 tok/s based on a single request, while Qwen3.5-122B-A10B-FP8 seems to output around 31 tok/s.

Since it won’t be used by just one person but by 2~3 people, I decided to use the vLLM serving tool. Currently, I am debating whether to use Qwen3.5-397B-A17B-int4-AutoRound or Qwen3.5-122B-A10B-FP8. Which one would be more advantageous in terms of quality?

I cannot find quantitative evaluations, so I am not sure which one to choose and use.

2 Likes

Wow, 26 tok/s on a dual cluster! That’s amazing. That what I got for Qwen 235b on dual cluster. I would prefer 397b over 122b

Quality: 397b is better. But with int4-autoround you could run two seperate single nodes with 122b each which gives much more scalability (also around 30 tok/s on single node)

26 t/s with 17B active parameters is pretty good, but what is the largest context window this setup affords you? Does it work well for you?

First, when I set the max model length to 262,144, it loaded normally and was able to serve the API even without the --enforce-eage flag.

When two people used it at the same time, it produced an average of 42–46 tok/s.
When I used it alone, it generated around 26–30 tok/s, and the results were quite good.

However, after looking a bit into the inference process and reviewing the code, I keep feeling that Qwen3.5-122B-A10B-FP8 is actually better. I think the Qwen3.5-397B-A17B-int4-AutoRound may have a significant quality difference due to 4-bit quantization.

I am currently continuing to use it for code development in Unity, and honestly, I really like Qwen3.5. It feels like having a very smart subordinate. The agent also works very well.

1 Like

Thanks for your feedback. Indeed we should run capability benchmarks, whenever we select different quants, as this has a significant impact on model performance, especially at 4bits.

What flags did you use to get this working?

I think I got the settings, but it seems like it is taking almost 6+ hours to start inference. Is this normal for int4?

You’re stuck on cuda graphs most likely.

Not exactly sure how OP got it running but using the latest vllm build - it currently does not.

Yup, mine doesn’t work either; wasted tons of time. Hope someone got it stably and share it…

I’m going to download and test it later - just need to clean up storage on my Sparks first :)

@Balaxxe @Icisu
Hello. If you build only with the latest vLLM, you will fail to load this model. My method and flags are as follows.

When building the vLLM Docker image, you must not only build from the main branch, but also enable PRE_TRANSFORMERS during the build.

Because:
If you look at tokenizer_config.json for Qwen3.5-397B-A17B-int4-AutoRound, the tokenizer_class is set to TokenizersBackend, which (as far as I know) is supported only from Transformers version 5 and above. If you don’t enable PRE_TRANSFORMERS, I understand that Transformers 4.x will be installed, so the model will not load properly.

And even if you enable PRE_TRANSFORMERS, you will still get an error when loading the model. To fix this, I referenced the following section from: “QuantTrio/Qwen3.5-122B-A10B-AWQ · Hugging Face

TF_FILE="$(python -m pip show transformers | awk -F': ' '/^Location:/{print $2}')/transformers/modeling_rope_utils.py" && echo "$TF_FILE"
NEW_LINE='            ignore_keys_at_rope_validation = set(ignore_keys_at_rope_validation) | {"partial_rotary_factor"}' \
perl -i.bak -pe 'if ($. == 651) { $_ = $ENV{NEW_LINE} . "\n" }' "$TF_FILE"

I created a patch command. (Runs only on Transformers 5.2.0 or later.)
The following patch command can be used by simply copying and pasting, but it is subject to a disclaimer, and I am not responsible for any issues that occur as a result.

[patch command]

TF_FILE="$(python -m pip show transformers | awk -F': ' '/^Location:/{print $2}')/transformers/modeling_rope_utils.py"
echo "Target: $TF_FILE"

# 백업
cp -v "$TF_FILE" "${TF_FILE}.orig"

python - <<'PY'
from pathlib import Path
import sys

path = Path("/usr/local/lib/python3.12/dist-packages/transformers/modeling_rope_utils.py")
text = path.read_text(encoding="utf-8").splitlines(True)

patch_line = '            ignore_keys_at_rope_validation = set(ignore_keys_at_rope_validation) | {"partial_rotary_factor"}\n'

# 이미 패치되어 있으면 종료
if any(patch_line.strip() == line.strip() for line in text):
    print("Already patched complete")
    sys.exit(0)

# partial_rotary_factor 블록 안에서 ignore_keys_at_rope_validation 초기화 직후에 삽입
inserted = False
for i in range(len(text)-5):
    if "partial_rotary_factor" in text[i] and "kwargs.get" in text[i]:
        # 그 아래쪽에서 ignore_keys_at_rope_validation = ( 찾기
        for j in range(i, min(i+80, len(text))):
            if "ignore_keys_at_rope_validation" in text[j] and "= (" in text[j]:
                # 닫는 괄호 ')' 다음 줄 위치를 찾아 그 다음에 patch_line 삽입
                for k in range(j, min(j+15, len(text))):
                    if text[k].strip() == ")":
                        text.insert(k+1, patch_line)
                        inserted = True
                        break
                if inserted:
                    break
        if inserted:
            break

if not inserted:
    print("Patch target not found XXX (file structure differs)")
    print("Run: grep -n \"partial_rotary_factor\" -n", path)
    sys.exit(1)

path.write_text("".join(text), encoding="utf-8")
print("Patched Complete")
PY

# 확인: 삽입된 줄이 있는지
grep -n 'ignore_keys_at_rope_validation = set(ignore_keys_at_rope_validation) | {"partial_rotary_factor"}' "$TF_FILE" || true

[/patch command]

With the build done using pre-transformers and the patch applied, the model run flags are as follows:

vllm serve /workspace/Model/Qwen3.5-397B-A17B-int4-AutoRound \
  --host 0.0.0.0 --port 8000 \
  --distributed-executor-backend ray \
  --trust-remote-code \
  --tensor-parallel-size 2 \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --max-model-len 262144 \
  --max-num-seqs 100

Additionally, if you are using Linux in GUI mode, there is a very high probability that a deadlock will occur while loading the model.
Therefore, it is strongly recommended to switch the system to console mode using:
sudo systemctl set-default multi-user.target
After switching to multi-user (non-GUI) mode, you should manage the server via SSH from another PC.

The following command switches the system back to graphical (GUI) mode:
sudo systemctl isolate graphical.target

Thanks I assumed as much. Our community dockerfile handles all of this for us with pre tf flag/image.

You should check it out: GitHub - eugr/spark-vllm-docker: Docker configuration for running VLLM on dual DGX Sparks

However, I think there was a recent vllm commit that may have broken functionality.

Curious if the latest build of vllm would still work for you.

Thanks for that patch info - maybe @eugr can incorporate this if he gets a chance to vet it out (he maintains the community repo).

It is already a part of the repo, just use --apply-mod mods/fix-qwen3.5-autoround as an argument to the launch-cluster.sh.

I guess I need to update the changelog - was too busy and didn’t do that when pushing this patch.

3 Likes

I finally managed to run 397B on dual Spark, 30t/s output speed, with 16 seq num and 128k seq length and with cache. It barely fits on Spark. Unlike what I mentioned earlier, it only takes 20 minutes to load the model.

I used the patch above; here are the flags I used:

–host 0.0.0.0
–port 8000
–tensor-parallel-size 2
–distributed-executor-backend ray
–trust-remote-code
–reasoning-parser qwen3
–enable-auto-tool-choice
–tool-call-parser qwen3_coder
–max-model-len 131072
–max-num-seqs 16
–enable-prefix-caching
–max-num-batched-tokens 4096

use --load-format fastsafetensors if you want faster load times.

Official INT4 versions are out, like this: Qwen/Qwen3.5-397B-A17B-GPTQ-Int4 · Hugging Face anyone could give it a try?

GPTQ is an older quantization format, AWQ and Autoround should give better accuracy. So it depends how Qwen produced this one. If they used something like QAT, it could be a good alternative. Unfortunately, they don’t provide any details on this quant in the description.

1 Like

We are working on getting max performance out of 4x cluster on Qwen3.5-397B

So far we have not been able to get the model online. TP=4 has Merlin sharding issue (in_proj_ba fused layer).
We are not focusing on TP=2 at the moment as the goal is to get TP=4 and TP=8 working.

I have no clue how you was able to run this. Everytime I try to run the 397B model on a spark cluster I am getting OOM on node 1. I have 4.1 GB ram utilization before starting eugr’s spark-vllm latest version in a ssh session And as adviced I have turned off graphic login. Any suggestions? I have taken exactly your env. flags. Only change is, I am using fastsafetensors.

cd ~/spark-vllm-docker

./launch-cluster.sh -t vllm-node-tf5
–eth-if enp1s0f0np0
–ib-if rocep1s0f0,roceP2p1s0f0
–apply-mod mods/fix-qwen3.5-autoround
-e VLLM_MARLIN_USE_ATOMIC_ADD=1
exec vllm serve Intel/Qwen3.5-397B-A17B-int4-AutoRound
–host 0.0.0.0
–port 8000
–tensor-parallel-size 2
–distributed-executor-backend ray
–trust-remote-code
–reasoning-parser qwen3
–enable-auto-tool-choice
–tool-call-parser qwen3_coder
–max-model-len 262144
–max-num-seqs 16
–enable-prefix-caching
–max-num-batched-tokens 4176
–gpu-memory-utilization 0.85
–kv-cache-dtype fp8

try the above - tweak eth-if/ib-if to your setup (I use the other connect x-7 port).

Use the latest builds from Eugr’s repo.

Should work.

1 Like