main issues are: compute and memory feeding couldnt be separated well, because all of the computing units are fed from the same memory (and its limited bandwidth)
everything thats gonna be distributed or shared between nodes is serialized, so it takes longer than single node in principle.
the nccl is very fast, but if a remote procedure is called 90 times a 120μs, 12ms are gone. so this operation has 92x per seconds as an upper limit, every further compute time lowers this.
the vllm is full if this. vllm is a framework for setting up and run local (gpu on same host) models. some extensions try to scale to other gpus, and furter more via network. but finally its not orchestrated well. its not made for dgx.
dgx is a complementary design to every cuda/ai approach before.
first users try to sell their dgx because they couldnt figure out how to use it. the dead horse marketing (nvp4) doesn’t help either.
just to give you a specific vision: you know that memory is limited for gpu and cpu totgether. so you run the linux system with X and a browser on it. why not having a youtube video running while you’re doings some ai research?
swap-laboratories/moe-configs at main · vedcsolution/swap-laboratories · GitHub Some Moe configurations for Intel_Qwen3.5-122B-A10B-int4-AutoRound, Qwen_Qwen3.5-122B-A10B-FP8, dgx spark, the tuned ones ran in ray, I don’t know if they need to be updated with this new launch method, –distributed-executor-backend external_launcher
tp=2 is already faster than tp=1 without modding vLLM. Depends on the model, of course, models with small number of active parameters don’t scale as well as large dense models, but we’ve been using tensor parallel to improve inference speeds on Sparks since November…
Sure, it depends. There is a scaling of ~1/2 of the token time that competes with the latency and wait cycles beyond NCCL, and mostly it compensates — sometimes fully saturated, the won time offsets the wait time.