I had codex benchmark with and without --no-mmap on llama-server, and the results weren’t great:
I benchmarked startup readiness for a step-3.5-flash llama-server setup by timing from process launch to the first
successful response from the Chat Completions API.Method:
- Launch llama-server with the same model/config in both cases.
- Send repeated “hello world” requests to /v1/chat/completions with max_tokens: 1.
- Record elapsed time until the first successful generated response.
- Only variable changed: --no-mmap enabled vs disabled.
Results (single run each):
- With --no-mmap: 16.100s
- Without --no-mmap: 108.038s
Difference:
- –no-mmap improved startup-to-first-response by 91.938s
- About 6.7x faster readiness
Note:
- After startup, per-request latency for tiny requests was similar in both cases. The major gain was initial server
readiness time.
I thought the new kernel was supposed to make mmap usable? This is the only system I’ve ever used where I can remember mmap being this slow… I don’t understand why it is so slow.
EDIT: more runs
Re-ran it 3x per variant (6 total), measuring launch-to-first-successful-chat-response for hello world, max_tokens: 1.
Per-run results:
- –no-mmap: 15.101s, 14.901s, 15.960s
- –mmap: 95.430s, 96.556s, 102.464s
Summary:
- –no-mmap: min 14.901s, median 15.101s, mean 15.321s, max 15.960s
- –mmap: min 95.430s, median 96.556s, mean 98.150s, max 102.464s
Delta (means):
- –no-mmap faster by 82.829s
- About 6.4x faster startup-to-first-response