I’m hitting a FilesBufferOnDevice exception when loading the Intel AutoRound quant of Qwen 3.5 397B on a dual Spark setup. I know others have this running successfully, so I’m trying to figure out if my local files are fundamentally different or if it’s my loader config.
The specific error is:
`Exception: FilesBufferOnDevice: key model.language_model.layers.59.mlp.experts.213.down_proj.qweight must be unique among files`
A clean download shows a direct conflict between the index and the binary shards:
model.safetensors.index.json maps the key to:
model-00038-of-00040.safetensors
But a grep through the actual shards shows it’s physically duplicated in 39 and 40:
it seems using –tf instead of –pre-tf, or copying recipe flags got me past that error, but at this point, I’m stuck with ray and taking up memory (gets oom-killed on worker). seems I need to switch over to the recipe format to drop using ray for now and see how that goes instead.
I’ve tried using –no-ray, I still have a memory error at initial startup saying that 111.xxGB is not enough to satisfy 112GB request. I have next to nothing running on on the sparks and I’ve disabled X11 (switched to multi-user.target) as part of initial setup. Only thing running outside of the system as usual is screen to multiplex my one ssh session.
May I ask which GB10 you have? I figured out that ASUS is reserving more memory than e.g. Nvidia Spark or Lenovo.
e.g. Nvidia spark in my setup has 122 GB free memory and util of 2.4 GB without anything in user space. The Asus has 120 GB free memory and memory util of 4.1 GB!
With this bacground I was enforced to lower the KV cache memory and reduze the max context size.
What is your overall free memory and how much memory is used by the nodes?
unfortunatly I was not able to do much.
mine after a fresh reboot without anything
total used free shared buff/cache available
Mem: 119Gi 3.6Gi 106Gi 160Ki 10Gi 115Gi
You could deinstall a lot of ubuntu stuff like cups, snapd, etc. This helps a little bit. At the end I need to start qwen3.5-397B with 106 GB memory and reduzed context size of 200k
Otherwise I get same error like you.
Guess you noticed (firmware update). But the ~2GB different is based on how much is reserved in UEFI firmware. NVIDIA reduced it from 4GB to 2GB (I think – going from my memory), so anyway, gaining ~2GB hit the founders edition first – and eventually followed through to the partner models. FYI.