I wonder if any one has this same obserwations as I do. I have been trying to get best performance on gpt-oss-120B and the outcome so far is that SGLand gives me very good performance with quite big context.
What are your observations, any suggestions on best setup? – I am new to spark (3 days), finalising my setup of workstation with rtx5090+spark as a stack for various local ai tasks.
Test 1000 tokens in 19.1s = ~52 tok/s
📊 SGLang GPT-OSS-120B Benchmark Results
| Test | Tokens | Time | Throughput |
|---|---|---|---|
| Short (127 tok) | 127 | 2.5s | 51 tok/s |
| Medium (500 tok) | 500 | 9.6s | 52 tok/s |
| Long (1000 tok) | 1000 | 19.1s | 52 tok/s |
Comparison with llama.cpp (from my earlier runs)
| Engine | Model | Throughput |
|---|---|---|
| SGLang | GPT-OSS-120B MXFP4 | 52 tok/s |
| llama.cpp | GPT-OSS-120B MXFP4 | 46 tok/s |
| llama.cpp | GPT-OSS-20B MXFP4 | 61 tok/s |
SGLang is ~13% faster than llama.cpp for the 120B model on single-user workloads
KV Cache / Context Window
Comparison: 128K vs 64K Context
| Setting | 128K Context | 64K Context | Savings |
|---|---|---|---|
| GPU Memory | 102 GB | 88 GB | 14 GB |
| Available | 12.8 GB | 28.7 GB | +16 GB |
| Max context | 131K tokens | 65K tokens | -66K |
| KV capacity | 418K tokens | ~200K tokens | -218K |
Memory breakdown:
- Model weights: 66.5 GB
- KV cache pool: ~28 GB
- CUDA graphs: 3.6 GB
- Available: 12.8 GB
Comparison: 128K vs 64K Context
| Setting | 128K Context | 64K Context | Savings |
|---|---|---|---|
| GPU Memory | 102 GB | 88 GB | 14 GB |
| Available | 12.8 GB | 28.7 GB | +16 GB |
| Max context | 131K tokens | 65K tokens | -66K |
| KV capacity | 418K tokens | ~200K tokens | -218K |