Been out of the forum for a couple of weeks and humbly apologize for not keeping up with it. I tried going through the threads and am still confused about it, so here goes: what’s the current status of NVFP4 support in the Spark? It is at least as fast as FP8 with lower memory footprint?
It works fine for me. I think there is some disillusionment about it, partially because of how int4 models perform compared to nvfp4 ones, but I think that’s largely explainable by the fact that many “nvfp4” models actually keep huge chunks retained at bf16 and so it’s not truly and apples-to-apples comparison.
At least, that was my initial frustration.
Kernel wise, everything works in vllm. The two key flaws that really screwed-up how sm120/121 operated have been resolved. There was the “illegal instruction” issue caused by lack of support in blackwell consumer for tcgen05 – but that was fixed in CUDA 12.9 and now emulates cleanly (not sure of how performant it is). And then there’s the other issue of the SMEM being smaller on the consumer-grade cards (99K vs 224K), and things not adapting to that limitation. That also seems to be fixed now across TRT-LLM, vllm, cutlass, etc.
There’s also the b12x kernel now, which is first-party from nvidia (well, written by an nvidia engineer) that improves performance, and I’m sure more improvements are on the way. The reality is that the Spark is a nice little machine, and it’s the cheapest way to get this much VRAM in a cuda ecosystem, but the physics of memory bandwidth remain somewhat of a limitation. I’d love to see a NUMA version where maybe the same GPU had access to 16GB of faster VRAM, but tbh the spark is still great and I know for me personally facilitates a ton of things I wouldn’t otherwise be able to do.
There’s also the d12 kernel now
d12? b12x?
whoops!