Inference best results on Spark - not llama.cpp not VLLM -> SGLand

I wonder if any one has this same obserwations as I do. I have been trying to get best performance on gpt-oss-120B and the outcome so far is that SGLand gives me very good performance with quite big context.

What are your observations, any suggestions on best setup? – I am new to spark (3 days), finalising my setup of workstation with rtx5090+spark as a stack for various local ai tasks.

Test 1000 tokens in 19.1s = ~52 tok/s


📊 SGLang GPT-OSS-120B Benchmark Results

Test Tokens Time Throughput
Short (127 tok) 127 2.5s 51 tok/s
Medium (500 tok) 500 9.6s 52 tok/s
Long (1000 tok) 1000 19.1s 52 tok/s

Comparison with llama.cpp (from my earlier runs)

Engine Model Throughput
SGLang GPT-OSS-120B MXFP4 52 tok/s
llama.cpp GPT-OSS-120B MXFP4 46 tok/s
llama.cpp GPT-OSS-20B MXFP4 61 tok/s

SGLang is ~13% faster than llama.cpp for the 120B model on single-user workloads

KV Cache / Context Window
Comparison: 128K vs 64K Context

Setting 128K Context 64K Context Savings
GPU Memory 102 GB 88 GB 14 GB
Available 12.8 GB 28.7 GB +16 GB
Max context 131K tokens 65K tokens -66K
KV capacity 418K tokens ~200K tokens -218K

Memory breakdown:

  • Model weights: 66.5 GB
  • KV cache pool: ~28 GB
  • CUDA graphs: 3.6 GB
  • Available: 12.8 GB

Comparison: 128K vs 64K Context

Setting 128K Context 64K Context Savings
GPU Memory 102 GB 88 GB 14 GB
Available 12.8 GB 28.7 GB +16 GB
Max context 131K tokens 65K tokens -66K
KV capacity 418K tokens ~200K tokens -218K

Yes, but it’s limited to a single orphaned “spark” build of SGLang that lags behind the main branch.

If you want to run gpt-oss-120b - yes, that’s a solid choice, but for everything else it’s llama.cpp (if you want the fastest token generation) or vLLM (if you run a cluster or want faster prompt processing).

1 Like

Thanks for your feedback, I’ll do more research and tests.

I think the vllm performance is a little disappointing with gpt-oss-120b, personally.

It feels like the Spark was made for gpt-oss-120b (or the other way around?) and that vllm should be up to the task.

The latest work in llama.cpp that enabled the big tps gain in gpt-oss is amazing.