SSD Expert Streaming

This project enable MacBook Pro with 48GB RAM to run Qwen3.5-397B-A17B at 4.4+ tokens/second

M3 Max can do 17.5 GB/s. Is internal SSD throughput on DGX Spark fast enough to implement this approach?

I believe DGX Spark External USB is 20GB/s but through put needs to be tested.

Update 1: PCIe 5.0 x4 allows for a theoretical bandwidth of up to 16 GB/s, with practical drive performance pushing over 14 GB/s so I think this approach might be worth a try.

I believe DGX Spark External USB is 20GB/s but through put needs to be tested.

The external USB ports on the DGX Spark are 20Gbit/sec, not 20GByte/sec. Best-case scenario, you’ll get around ~2GB/sec for large sequential reads from an external device.
It’s been tested multiple times, and there’s a few posts here in the forum about external USB drive speeds.

Update 1: PCIe 5.0 x4 allows for a theoretical bandwidth of up to 16 GB/s, with practical drive performance pushing over 14 GB/s so I think this approach might be worth a try.

It’s certainly worth a try, but chances are you won’t see anywhere near 14GB/sec from the internal NVMe drive.
Here are results of a quick test on my FE DGX Spark with the Samsung 4TB drive:

dgx-spark:~$ sudo hdparm -tT --direct /dev/nvme0n1

/dev/nvme0n1:
Timing O_DIRECT cached reads:   21198 MB in  2.00 seconds = 10613.40 MB/sec
Timing O_DIRECT disk reads: 19930 MB in  3.00 seconds = 6643.07 MB/sec

And here are the results for the same test from a HP ZGX Nano G1N with a 4TB Corsair MP700 Micro:

zgx-spark:~$  sudo hdparm -tT --direct /dev/nvme0n1

/dev/nvme0n1:
Timing O_DIRECT cached reads:   4840 MB in  2.00 seconds = 2420.44 MB/sec
Timing O_DIRECT disk reads: 10552 MB in  3.00 seconds = 3517.18 MB/sec
1 Like

Just keep an eye of RAM utlilization while loading the experts, so you don’t get into a situation where you’re using Linux Swap, the heavy writing could accelerate wear of your NVMe drive

1 Like

Thanks for the feedback. I will do further investigation

They are using that project to run 400 B LLM on iphone : Anemll (@anemll): "Running 400B model on iPhone! 0.6 t/s Credit @danveloper @alexintosh @danpacary @anemll" | XCancel

It will be interesting to see if it can be duplicated in DGX Spark, convert the code base to CUDA will need some work.