Hey guys,
since over the past days I was eagerly waiting for M2.7 REAPS to come out with quants suitable for single DGX Sparks but I could not find anything and also had holidays from monday to wednesday I created one myself. Unfortunately I only have a single spark plus a 256GB Threadripper with 2x 3090s so my setup is very shitty for calibrating with huge samples sizes and samples lengths.
Also running evals takes forever, I ran GPQA Diamond for over 15 hours just to discover I was too stingy with the max tokens at 16k and 30% of samples didnt finish reasoning within this budget. It still got 60% so the general intelligence seems to be there. Will run some more evals on external compute over the weekend. will need to rent some H200 most likely.
Anyways I thought i still share it here if someone wants to give it a go as well :)
Speed is basically the same as 122B A10B Qwen model with autoround.
There is definitely room for improvements with new kv cache quants landing in vllm. On a single spark with fp8 you can get around 100k context i guess. I havnt pushed it all too far since my agents are also sharing some portion of the ran most of the time
| depth | prefill tok/s | decode tok/s | TTFT (ms) |
|---|---|---|---|
| 0 | 2469.3 ± 13.3 | 29.28 ± 0.05 | 864.5 |
| 4096 | 2089.9 ± 12.5 | 27.73 ± 0.05 | 2784.8 |
| 8192 | 1890.3 ± 5.2 | 26.28 ± 0.05 | 5062.3 |
| 16384 | 1601.1 ± 6.5 | 23.88 ± 0.05 | 10647.7 |
Happy for any feedback, this is just a first draft. WIll need to pick some better and bigger datasets for REAP and quantisation calibration. Just wanted to validate everything end to end first before renting out more compute