@Balaxxe @Icisu
Hello. If you build only with the latest vLLM, you will fail to load this model. My method and flags are as follows.
When building the vLLM Docker image, you must not only build from the main branch, but also enable PRE_TRANSFORMERS during the build.
Because:
If you look at tokenizer_config.json for Qwen3.5-397B-A17B-int4-AutoRound, the tokenizer_class is set to TokenizersBackend, which (as far as I know) is supported only from Transformers version 5 and above. If you don’t enable PRE_TRANSFORMERS, I understand that Transformers 4.x will be installed, so the model will not load properly.
And even if you enable PRE_TRANSFORMERS, you will still get an error when loading the model. To fix this, I referenced the following section from: “QuantTrio/Qwen3.5-122B-A10B-AWQ · Hugging Face”
TF_FILE="$(python -m pip show transformers | awk -F': ' '/^Location:/{print $2}')/transformers/modeling_rope_utils.py" && echo "$TF_FILE"
NEW_LINE=' ignore_keys_at_rope_validation = set(ignore_keys_at_rope_validation) | {"partial_rotary_factor"}' \
perl -i.bak -pe 'if ($. == 651) { $_ = $ENV{NEW_LINE} . "\n" }' "$TF_FILE"
I created a patch command. (Runs only on Transformers 5.2.0 or later.)
The following patch command can be used by simply copying and pasting, but it is subject to a disclaimer, and I am not responsible for any issues that occur as a result.
[patch command]
TF_FILE="$(python -m pip show transformers | awk -F': ' '/^Location:/{print $2}')/transformers/modeling_rope_utils.py"
echo "Target: $TF_FILE"
# 백업
cp -v "$TF_FILE" "${TF_FILE}.orig"
python - <<'PY'
from pathlib import Path
import sys
path = Path("/usr/local/lib/python3.12/dist-packages/transformers/modeling_rope_utils.py")
text = path.read_text(encoding="utf-8").splitlines(True)
patch_line = ' ignore_keys_at_rope_validation = set(ignore_keys_at_rope_validation) | {"partial_rotary_factor"}\n'
# 이미 패치되어 있으면 종료
if any(patch_line.strip() == line.strip() for line in text):
print("Already patched complete")
sys.exit(0)
# partial_rotary_factor 블록 안에서 ignore_keys_at_rope_validation 초기화 직후에 삽입
inserted = False
for i in range(len(text)-5):
if "partial_rotary_factor" in text[i] and "kwargs.get" in text[i]:
# 그 아래쪽에서 ignore_keys_at_rope_validation = ( 찾기
for j in range(i, min(i+80, len(text))):
if "ignore_keys_at_rope_validation" in text[j] and "= (" in text[j]:
# 닫는 괄호 ')' 다음 줄 위치를 찾아 그 다음에 patch_line 삽입
for k in range(j, min(j+15, len(text))):
if text[k].strip() == ")":
text.insert(k+1, patch_line)
inserted = True
break
if inserted:
break
if inserted:
break
if not inserted:
print("Patch target not found XXX (file structure differs)")
print("Run: grep -n \"partial_rotary_factor\" -n", path)
sys.exit(1)
path.write_text("".join(text), encoding="utf-8")
print("Patched Complete")
PY
# 확인: 삽입된 줄이 있는지
grep -n 'ignore_keys_at_rope_validation = set(ignore_keys_at_rope_validation) | {"partial_rotary_factor"}' "$TF_FILE" || true
[/patch command]
With the build done using pre-transformers and the patch applied, the model run flags are as follows:
vllm serve /workspace/Model/Qwen3.5-397B-A17B-int4-AutoRound \
--host 0.0.0.0 --port 8000 \
--distributed-executor-backend ray \
--trust-remote-code \
--tensor-parallel-size 2 \
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--max-model-len 262144 \
--max-num-seqs 100
Additionally, if you are using Linux in GUI mode, there is a very high probability that a deadlock will occur while loading the model.
Therefore, it is strongly recommended to switch the system to console mode using:
sudo systemctl set-default multi-user.target
After switching to multi-user (non-GUI) mode, you should manage the server via SSH from another PC.
The following command switches the system back to graphical (GUI) mode:
sudo systemctl isolate graphical.target