NVFP4 on DGX Spark / GB10 is broken. I bought 9 of these for this feature. Requesting NVIDIA's official roadmap and response

I am posting this as a paying customer with 9× DGX Spark / GB10 nodes in production (~$38k invested) asking NVIDIA for an on-record response on the state of NVFP4 on this hardware. I want a reply from someone empowered to speak to the DGX Spark product roadmap, not another community comment please.

I bought this hardware specifically for NVFP4. The software to make that usable is not there. This post documents, with primary-source citations only, what NVIDIA promised, what the community has measured, what the community has fixed on its own, and the complete absence of any badged NVIDIA-staff response addressing the gap.


My deployment

  • 9× GB10 DGX Spark / OEM equivalents, ~$4,000 each
    • Cluster of 8 + single node
  • 2× Mikrotik CRS804 fabric, ConnectX-7 on every node
  • Head node running SGLang serving GLM-5.1 (754B / 40B active) FP8 — 24/7 agentic coding workload (because nvfp4 does not work without workarounds that are still non-optimal)

FP8 serving works. NVFP4 does not. That is the entire premise of this post.


What NVIDIA promised about NVFP4 on GB10 — verbatim quotes from NVIDIA’s own materials

DGX Spark hardware datasheetHardware Overview — DGX Spark User Guide

“Up to 1,000 TOPS (trillion operations per second) inference and up to 1 PFLOP (petaFLOP) at FP4 precision with sparsity”
“NVIDIA Blackwell Architecture with 5th Generation Tensor Cores

DGX Spark product pageNVIDIA DGX Spark: AI Supercomputer on Your Desk

“up to one petaFLOP of FP4 AI performance”

Nemotron-3-Super-120B-A12B-NVFP4 model cardnvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 · Hugging Face

“Minimum GPU Requirement: 1× B200 OR 1× DGX Spark”
Deployment section includes: “vLLM on DGX Spark: To deploy the NVFP4 checkpoint on NVIDIA DGX Spark…”
Published benchmark hardware on the card: H100, H200, GB200. No GB10 / DGX Spark numbers are published anywhere.

Nemotron-3-Super announcement blogIntroducing Nemotron 3 Super: An Open Hybrid Mamba-Transformer MoE for Agentic Reasoning | NVIDIA Technical Blog

“4x on NVIDIA B200 compared to FP8 on NVIDIA H100”

The headline NVFP4 speedup figure is measured on B200 — not on the GB10 hardware NVIDIA lists as a supported deployment target.

GTC 2026 NemoClaw blogRTX PCs and DGX Spark Supercomputers Run AI Agents Locally | NVIDIA Blog

“Nemotron 3 Super is optimal for powering agents on the DGX Spark or NVIDIA RTX PRO workstations.”

NVIDIA’s own most recent marketing directs customers to run its flagship NVFP4-native model on DGX Spark while publishing zero GB10 benchmarks and delivering a software stack that does not exercise the FP4 tensor cores the hardware was sold on.


What actually runs on GB10 — NVFP4 measurements posted to NVIDIA’s own developer forum

Llama-3.3-70B-Instruct-NVFP4 on TensorRT-LLM (NVIDIA’s own flagship NVFP4 model, on NVIDIA’s own first-party inference stack):

The second thread documents that vanilla GGUF Q4_K_M via LM Studio runs the same 70B model at 4.6–4.9 tok/s on the same Spark — NVIDIA’s NVFP4 model on NVIDIA’s TRT-LLM is slower than a non-NVIDIA quant on non-NVIDIA tooling.

Nemotron-3-Super-120B-A12B-NVFP4 (NVIDIA’s other flagship NVFP4 model, explicitly named on the model card as deployable on DGX Spark):

Why 19–22 tok/s is unambiguously bad on this hardware

Nemotron-3-Super has 12B active parameters per forward pass. At NVFP4 (0.5 bytes per parameter), each decoded token reads ~6 GB of active weights from memory. GB10 has 273 GB/s of LPDDR5x bandwidth (per NVIDIA’s own datasheet).

  • Theoretical bandwidth-limited ceiling: 273 ÷ 6 = ~45 tok/s for this model on this silicon, even if every other overhead were zero.
  • Measured: 19–22 tok/s = 42–48% of the bandwidth ceiling.
    • Even at 200GB/s across dual Mikrotik CRS804 switches that’s around 34 tok/s. I’m getting almost 200GB/s with a raw cluster test.
  • What a reasonable well-optimized NVFP4 path should deliver on this hardware: ~30–40 tok/s (60–80% bandwidth efficiency is routine on GB10 in other configurations).
  • Put plainly by a community member on the HuggingFace model-card discussion: “This is a model with 12B activated parameters per token. It should generate at least 30 t/s.”nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 · All this talk about NVFP4 - why is it dog slow?

The hardware is leaving roughly half its achievable throughput on the floor on NVIDIA’s own NVFP4-native flagship model. This is not a memory-bandwidth limitation. This is a kernel / software-stack limitation. The FP4 tensor cores NVIDIA marketed are not being exercised effectively.

The HuggingFace discussion thread on the model card is titled “All this talk about NVFP4 — why is it dog slow?”nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 · All this talk about NVFP4 - why is it dog slow? . NVIDIA has posted no substantive reply.


Direct A/B comparisons on identical models: NVFP4 loses on the hardware it was built for

Same model. Same hardware. Same framework. Different precision. Community-reported numbers with URLs.

Qwen3-Next-80B-A3B-Instruct, single Spark, vLLMQwen3-Next AWQ 4bit vs FP8 vs NVFP4 on single spark

Precision Decode tok/s
FP8 44.56
NVFP4 39.54
AWQ 4-bit 32.82

FP8 beats NVFP4 by 12% on the same 4-bit-class memory footprint hardware path.

Qwen3-VL-235B, 2× Spark, vLLMPSA: State of FP4/NVFP4 Support for DGX Spark in VLLM

Precision Decode tok/s (1 req) Decode tok/s (10 concurrent)
AWQ 4-bit 24.93 42.11
NVFP4 18.91 35.58

AWQ beats NVFP4 by 18–32% on the precision NVIDIA has been most aggressively marketing.

MiniMax-M2.7, 2× Spark, vLLMMiniMax M2.7 NFVP4 Recipe & Benchmarks

Precision Decode tok/s
AWQ 4-bit 39.39
NVFP4 (FlashInfer-cutlass, fully optimized) 25.69

AWQ beats NVFP4 by 53%.

Nemotron-3-Nano-30B-A3B-NVFP4, single Spark, vLLMMarlin Fix: NVFP4 Actually Works on SM121 (DGX Spark)

NVFP4 backend Decode tok/s GPU memory
Default (FlashInfer — what ships from NVIDIA) 42.6 39 GB
Community Marlin patch 50.0 (+16%) 32 GB (−7 GB)

The default NVFP4 path NVIDIA ships costs 16% throughput and wastes 7 GB of GPU memory compared to a community-built patch.

The pattern is unambiguous: on GB10, NVFP4 is currently slower than FP8, slower than AWQ, and slower than community-patched NVFP4 using non-default backends. The headline format of this hardware is the worst practical format option on it.


The community has documented the underlying issues exhaustively

Threads on NVIDIA’s own developer forum — all customer-opened, none with a badged-NVIDIA-staff resolution:

Community engineering work that NVIDIA has not adopted:

GitHub-side issue tracking:

The customer community has built and shipped the patches. NVIDIA has not adopted them, has not provided official equivalents, let alone, published a roadmap or reassured it’s customers.


The asymmetry

After reading every thread and GitHub issue linked above, I have not found a single response from a verifiable, badged NVIDIA staff member that addresses any of the following:

  1. A roadmap or timeline for SM121 NVFP4 software support reaching parity with SM100
  2. An official acknowledgement of the dense (non-sparse) FP4 peak on GB10 — the datasheet headline is “1 PFLOP at FP4 precision with sparsity”; the dense number is not published with equivalent prominence
  3. A commitment to upstream the community SM121 patches (PR #37700, Marlin NVFP4 backend, SM121 CUTLASS grouped-GEMM work) into NVIDIA’s first-party container images
  4. Clarification of which FP4 MMA functionality SM121 silicon actually implements vs what is software-disabled — community reads of CUTLASS source suggest hardware-level gaps; NVIDIA documentation does not confirm or deny
  5. An ETA on nvcr.io/nvidia/vllm images with native SM121 FP4 paths enabled by default

If an official response exists that I missed, please reply here with a link.


What I am asking for, specifically

I want, on-record, from someone empowered to speak for the DGX Spark product team or the CUTLASS / TensorRT-LLM engineering groups:

  1. Is SM121 NVFP4 parity with SM100 on the roadmap? Yes, no, or partial. If no, say so plainly so customers can architect around the limitation.

  2. Publish the dense FP4 peak on GB10. The sparsity-qualified “1 PFLOP” number is in the datasheet. Publish the dense number with equivalent prominence. Customers deserve to be able to compare like-for-like.

  3. Commit to upstreaming the community SM121 fixes — PR #37700, the Marlin NVFP4 backend, the SM121 CUTLASS grouped-GEMM work — into NVIDIA’s first-party container images, or ship official equivalents. Customers should not be the ones patching CUTLASS and FlashInfer to get NVFP4 to work on hardware NVIDIA sold for NVFP4.

    1. Seriously…I had expected to load up models and spend my time doing real work, not deal with stability issues for a box that was marketed to work.
  4. Clarify the silicon. One paragraph from NVIDIA on which FP4 MMA functionality SM121 hardware implements vs what is software-disabled resolves the central technical question of this entire debate.

  5. Publish Nemotron-3-Super-120B-A12B-NVFP4 numbers on DGX Spark. Your own GTC 2026 blog calls DGX Spark “optimal” for this model. The model card lists DGX Spark as a supported deployment target. On a single Spark it runs at 19–22 tok/s, roughly half the bandwidth-limited ceiling for a 12B-active model on 273 GB/s memory. Either fix the kernels so the measured number approaches the achievable, or correct the marketing.


Why this matters beyond me

NVFP4 was the headline feature of this hardware. The model releases (Nemotron-3-Nano, Nemotron-3-Super) were built around it. The marketing (“5th Generation Tensor Cores”, “1 petaFLOP FP4”, “optimal for DGX Spark”) was built around it. I and many others bought into this platform specifically because NVIDIA positioned NVFP4 on GB10 as a first-class path.

On measured reality, NVFP4 on GB10 is slower than FP8, slower than AWQ, and slower than community-patched NVFP4. The fix appears to exist in community patches but is not in shipping NVIDIA software. The asymmetry between how much customer engineering work is on public record here and how little NVIDIA engagement is on public record here is not sustainable for a product NVIDIA continues to actively market under the Grace Blackwell brand.

NVIDIA built and sold this product with NVFP4 as the headline. Honor it please!

I can provide additional benchmarks, traces, a repro environment from my fleet if that helps an NVIDIA engineer engage substantively. I would prefer a public reply on this thread so other customers can plan around the answer.

PLEASE, FIX THIS! And please, give us some answers and commitments…the hardware is not cheap!

Thank you.

Watch for my post in about 10 minutes on PrismQuant.

The reason you see NVFP4 sucking so hard compared to Int4 is because when something gets quantized to Int4 (using autoround), ALL THE LAYERS are in Int4. When something gets quantized to NVFP4, probably half stay in bf16 because there’s too much concern about quality degradation.

I’m about to fix all that.

After the last few vllm/flashinfer releases, NVFP4 works fine. Idk if 37700 has been merged yet (I wrote it), but I don’t think it’s going to massively change things.

Here here – this was my first post to this forum 2 months ago. We have 2x GB10 DGX Sparks. In this whole time I never saw a single engagement or forum response from Nvidia on this subject. The only help I got was from @eugr on spark-vllm-docker and @flash3 on setting up int4 AutoRound bfloat16 and countless other community members. Not a single bit of help from Nvidia. I never see them publicly praise this community for all the heavy lifting it does for them. Leaves me with a bitter after taste – not the nice kind.

Nvidia – I would remind of the copious research on technology diffusion. We are your visionaries. A deeper pocketed, technically oriented, tightly-bound, self-referential community that is spontaneously emerging. You need us in your market development to convince early adopters this technology works. Right now it doesn’t and outwardly it feels like you are ignoring us. This never goes well. Maybe you don’t care. I do. In this day-and-age your behaviour is frankly contemptuous.


Just to be really clear. This is not a criticism of the forum staff. They appear to be very responsive and actively supporting users with technical problems. For whatever reason the subject of NVFP4 support on the DGX Spark appears outside their purview.

In my opinion, if the community is able to solve the problem—even partially—but the company fails to do so, then the reasons for its inaction are not technical in nature.

I’m not sure if I totally agree with the original author but I’m guessing you totally missed his point. He bought 9x DGX Sparks for production use because Nvidia called it a Super Computer with full Blackwell compatibility , all the top numbers in the PR. And we are all chasing optimizations with CUDA, PyTorch, Speculative Decoding, Quantization, etc.

This isn’t something new for Nvidia, they do that ALL the time. 40x0 on Laptop and Desktop have different capabilities, GPU cores and performance, etc. When is a RTX 5000 not a RTX 5000 series ?

They don’t even seems to really know what their boxes are capable of, so somebody really need to walk up to Jensen Huang in GTC or some other events and tell that to him in the face.

I bought the box for Unsloth Studio testing, but I will NOT recommend this to anybody.

This is my biggest complaint of the platform. I expected exactly this out-of-the-box. Choose a new model release (in NVFP4 if available!!), load it up, use it. Not hours and days and even weeks of debugging. If it wasn’t for @eugr, @dbsci and a few others basically doing Nvidia’s work for them, I would have already sold my units.

Never try to beat the laws of nature.

Nine times zero is still zero.

To be fair, it’s not actually zero — it’s that much-discussed, absolute bare minimum of an idea of a data center … thing.

NVFP4 is a dead horse. Everyone knows it.
Why pitch a bare-minimum box flogging a dead horse? A kind of discipline? Consistency, all the way down?

It’s all been said, sadly. The fact of the matter is this: DGX Spark enthusiasts/adopters are not important enough to NVIDIA for them to make the effort.

The company is too flush, lacks enough viable competition and is developing the kind of hubris that tends to eventually bite one in the backside.

In the meantime, we can hope for scraps from the table — or continue to support the community who have contributed more to making this a viable product than its manufacturer.

I know well-intentioned folks from NVIDIA will pop into a thread here or there and suggest support is coming. But I’m pretty convinced at this point it is not going to happen.

Commotion on Olympus

The god of envy had been tinkering for over a year on a small golden box that somehow nobody really wanted to love. His color was already fading, and the other gods were starting to worry. Is he ill? Where has the green gone?

Mars was the first to approach him. “Brother, what’s the matter? You promised a petaflop.” — “With sparsity,” he muttered. “With sparsity I had promised it.”

Minerva the Wise leafed through the parchments scattered around his workbench. Benchmarks. Forum posts. Something called SM121. She furrowed her brow. “A mortal writes here that he bought nine of your boxes. Nine. And he does not sound envious. He sounds … disappointed.”

The god of envy flinched. Disappointment was not his department. Disappointment belonged to Spes, the goddess of hope, and precisely at the moment when she failed. Not to him.

Vulcanus came out of his forge, sooty and sweating, and took a longer look at the box. Then a second one. He unscrewed it, shone a light inside, ran his sooty fingers across the silicon. “It says right here in big letters: Fifth Generation Tensor Cores.” Pause. “But I can’t find them. At least not the ones that are in the other box, the big one. There’s something else in here. Smaller. Less. I don’t quite know what.” — “Then why does it say so on the outside?” asked Ceres. Vulcanus shrugged. “Ask marketing. Not my craft.”

Mercurius, the swift messenger, landed in a flutter beside them. “I was just in the forum. The mortals are writing. A lot. They’re patching it themselves. One is called tenari, one flash3, one eugr. They are fixing your box, brother. They are fixing it for you.”

He raised his head. “And what am I doing?” — “You are silent,” said Mercurius. “For months.”

An uncomfortable silence fell over the workshop. Bacchus, who happened to wander by — as he always happens to wander by when things get uncomfortable — suggested writing a blog. “Something with optimal for. That always sounds good.” Minerva looked at him. “You did that last month. And you never published the benchmarks on the box itself.” Bacchus nodded. “True. I was busy.”

The god of envy stared at his hands. They were pale. Almost grayish. The green — where had the green gone?

“Perhaps,” said Janus quietly, who had two faces and therefore knew both sides, “perhaps you are not ill at all, brother. Perhaps you are simply no longer the one they named you after.”

Olympus fell silent.

Six months on this thread with no badged reply, so I figure it’s worth putting the alternatives in
one place — a few folks have asked in DM.

If NVFP4 support on GB10 was part of why you bought the hardware, and the current state doesn’t match
what you expected, most jurisdictions have a consumer-protection body that handles exactly this kind
of gap. Filing a report isn’t making any legal claim against anyone; it just puts the issue on a
desk that has authority to look into it. One report rarely does much. Fifty reports do.

  • US: reportfraud.ftc.gov, or your state Attorney General’s consumer-protection office.
  • EU: your national consumer authority, or ECC-Net for cross-border purchases.
  • UK: the CMA, or Trading Standards via Citizens Advice.
  • Canada: the Competition Bureau.
  • Australia: the ACCC.
  • Taiwan: the Fair Trade Commission at ftc.gov.tw, under Fair Trade Act §21 (misleading
    representation).

If you’re still inside your card’s chargeback window, that’s usually the fastest route. “Goods or
services not as described” is one of the standard reason codes for this kind of situation — you work
it out with your bank, not with any court.

Regardless of which path anyone takes — including continuing to wait — it’s worth saving evidence now
while it’s still easy: invoice, dated screenshots of the DGX Spark / GB10 product pages (archive.org
if the live pages have shifted), any docs or posts that referenced NVFP4 support on GB10, and a link
to this thread with timestamps. Getting this stuff later is much harder.

Not advocating any particular course of action, and none of this is legal advice. Just making sure
the option set is visible.

These kind of posts really irritate me, and I’m certain make Nvidia’s people less likely to engage with us here.

Please consider:

  • We now have a completely operational NVFP4 path on GB10 (new within the last month) even on inference frameworks Nvidia doesn’t control; this is the default on the community Docker. So claims that it doesn’t work at all would never pass minimal scrutiny.
  • This has required a bunch of PRs across many open repositories, mostly by Nvidia employees, so there is no basis for neglect.
  • Flashinfer 0.6.8.1 has further improvements merged in just a couple days ago.
  • While 1PFlop is advertised, understand this is specified as sparse in the official materials and the memory bandwidth has always been clear.

Trying to make such claims would have only one guaranteed outcome: enriching your lawyers on your behalf.

I recommend constructive engagement, not this.

I just wanted to add my 2 cents to the OP.

My employer also purchased 8 sparks, of which I’ve had the fortunate luck to have at my home test and ‘figure things out’ for the Spark’s that will be used by the non dev teams. I tool am confused how one of the richest if the THE richest companies (I honestly don’t track that it’s of no matter to me) in the world at least in the IT sector put out hardware before its software stack could live up to its claims. It was one thing if they were on the cusp of realizing a fix.

But as you very precisely point out this has been something Nvidia has been drumming up for a long time. I would bet that many of the end users here in the forums(myself included) who are not hardcore AI engineers or super dev wizards, either spent their person hard earned money on an expensive AI “Supercomputer”, now many $4-5K isn’t a lot of money for some of these folks but for others I’m sure it was a big purchase. The people in here I think fall into 3 classes of users: 1. Home user, probably disappointed with performance for partial mis understanding of what the Spark’s are really for and the other part is the marketing not meeting expectations. 2. Power users (myself) who don’t really know AI and the inner workings well but know computers well and catch on quick, we are dependent on the generosity of people like Eugr and MANY other’s who spend their own time (time always = money), building out fixes, patches, updates ect… so that the rest of us can use their optimized/patched backends to run LLMs and get decent speeds.

We know their work is truly professional quality work since NVIDIA was merging some of their fixes into PR’s to eventually find their way into the Spark’s official codebase. Without them group 2 (where I am in) would be lost, probably running things in terrible configurations with poor speeds. The last group, group 3, are the Eugr’s of the world, the true wizards that understand this stuff inside and out, and they are driving the hardware forward essentially through their code! I’ve seen comments here and there from NVIDIA staff or moderators or whatever they are, I dont know if they are on payroll or not, but they are always praising the work they do and help get it into the official codebase.

Things such as sparkrun and spark arena and the custom github repo’s many of this wizards product make it possible for someone like myself to run a model in vLLM and get great performance, otherwise I would be searching the web for how to get things to run better and the playbooks which at first I thought were designed to be like high grade usable for long term use turned out to be just be minimal examples to get things running on the various setups. I do thank Nvidia for these playbooks they did get me started. I wish they were a bit deeper, maybe explained more. But maybe I’m expecting too much.

I remember when I got the Spark, it was a hardware warranty directly through Nvidia, and for ‘all software related issues’ to come here. This forum has been a tremendous resource because of the kind nature of everyone and the patience for the ignorant (like me!) on advanced AI topics. The true hero’s though are those putting out these custom fixed code. Nvidia said to come here officially, for software support, yet here we all are and we have no official thread where Nvidia is actively providing regular updates on the progress of NVFP4. The fact they haven’t responded to this thread saddens me.

As paying customers, and businesses, here we are in the software forums the ONLY place to get software support. Yet we are not being updated regularly. I shouldn’t have to hunt in some random post to see if Nvidia responded on it. It should be a pinned top of the forums, ‘here’s what we are doing and own it and continue to update us’.

My company is considering purchasing another larger batch of these. So I feel our voices count, companies like the one that employs me is what is causing these things to sell like crazy. I’m not here to bash Nvidia nor am I going to go to any lawyers or anything like that. I’m just here to agree it would be nice to see official NVIDIA staff keeping us updated. That’s all. I was excited when I read and learned about NVFP4 and its potential, and also that Nvidia was putting out their own models, I figured oh man this has to be the gold standard in terms of the Spark’s capabilities. As such I have learned this is not the case.

So I do kindly with all due respect ask that Nvidia be more up front on what’s going on. People appreciate updates even if its not “fixed”. I pushed for my company to get these based on what Nvidia said it could deliver. Are the Spark’s amazing hardware, yes! Absolutely! Is the code caught up. No! We have a major AI push in our workplace. There’s money on the table, would I recommend sparks for some use cases, maybe, maybe not. I would have to be up front about that I am relying on the works of individuals outside of Nvidia to deliver the best performance. That may or may not go over well. Their contributions are what keeps the innovation moving forward, heck if I was someone with power at Nvidia I would gift a Spark to certain people who have made major contributions who’s generosity and time is finding it’s way into Spark official updates/patches. But thats just me.

I love the Spark, in fact I want a second one to unlock some of the bigger models. But my employer isn’t going to allow me to have two at home for testing. Though I’m at the point that when they ask for it back, I’ll have to buy a Spark on my own because it has been life changing for my IT career. I was kinda thinking maybe a cheaper version (without the connectX 7’s) for those who never plan on having two, and could save them a lot of money. Or a “version 2” of the Spark with improvements but from the little research I’ve done that doesn’t seem to be in the roadmap. So thanks to the tireless efforts of contributors here, I am going to purchase one eventually. I hope when that time comes that Nvidia’s software stack has caught up to its hardware. So to the OP you are not alone, a lot of us, from newbies like me to experienced pro’s are disappointed on how Nvidia has handled this situation. All I’m doing is tossing my hat into the ring to say that I share the same frustrations, hoping to nudge Nvidia to step it up. They have the resources and manpower to get it done, regardless if the software is open source 3rd party and has a lot of moving parts involved. You are a tech titan corporation. There’s really no excuse. You have all the money and talent at your fingertips. The priority isn’t there, perhaps because it is treated as a consumer devices vs a full on enterprise grade system. Pushing close to $40K my employer has dropped on these, I feel we have paid enough for our voice to at least be heard and acknowledged.

I mean no disrespect, I’m not trying to start an argument with anyone, others may not agree with my post. That’s ok. My personal opinion is the more paying customers add to this thread in a constructive non-hostile way, the better.

Cheers.

Sorry you are irritated. However, if you bought the hardware in October like I did you have been waiting six months for a promised feature to be realized.

This is a company that earned over $200B in revenue in the last fiscal year – they are not hurting for resources. They can fix this problem. It’s that this development is not being prioritized. Perhaps shame on us to thinking there was a real commitment for anything other than huge data center products.

One small nit from your comment – OK, so VLLM is a project they don’t control. What’s the excuse for TensorRT-LLM being basically non-viable on the platform?

I don’t think stating this opinion is anything less than respectful and constructive. It’s simply the truth.

Nvidia releases their own playbooks, quants of models, and even Docker images of vLLM. Not having control of vLLM doesn’t really make sense to me, because it’s open source.

But you’re right though. We’re a small-fry of a customer base to their 200B US$ revenue. Invidual salary of a full-time senior software engineer at Nvidia can be anywhere from 300K to upwards 600K. It makes me wonder if dedicating even one software engineer is worth their profits bottom-line for Spark sales.

Just hire someone from these forums or something /shrug

Oh, I may have missed the point. It’s not the device you bought that makes others envy you — it’s the salary Nvidia pays (its workers) that makes you envy them. A new perspective.

But honestly, take a look at the M5. Although the compute isn’t CUDA yet, things like drafters work like hell because… the attentive audience has already noticed: nothing is more memory-bandwidth-bound than the GB10. And yes — it feels so good to be allowed to write this down here. Thanks.

It’s GTX970 with “4GB” VRAM situation again.

Ironically I just got an email about the new DGX Stations with GB300.. and then saw the same specs of crap LPDDR5X memory in them.. oh no.

GB300 has loads of HBM3 memory in it too, and even LPDDR5x bandwidth is higher there. It’s a completely different class of hardware though.

Don’t want to derail this thread so maybe we start another, but I looked through multiple sales posts that stated it as all just unified 768GB memory LPDDR5x @396GB396GB/s. The sales posts didn’t mention anything else. This was from a number of manufacturers including Dell, Gigabyte, MSI and at least 3 or 4 others. If it’s something other than that, it’s really bad marketing.

Edit: Ok, I’m seeing some posts breaking it down by low speed and HMB3e memory, but a lot aren’t. Makes things really confusing, as I was immediately turned off by the posts that turned up for me first.

It’s in their marketing materials though. But yeah, don’t want to derail the thread either, and again, this is a completely different class of hardware with completely different costs.

From NVIDIA DGX Station: Ultimate AI Supercomputer

NVIDIA GPU

1x NVIDIA Blackwell Ultra


NVIDIA CPU

1x Grace 72-Core Neoverse V2


GPU Memory

252 GB HBM3e | 7.1 TB/s


CPU Memory

496 GB LPDDR5X | 396 GB/s


NVLink-C2C

900 GB/s