The best way to test it is with llama.cpp, but it’s even more dramatic with vllm.
First, install llama.cpp.
Install development tools:
sudo apt install clang cmake libcurl4-openssl-dev
Checkout llama.cpp
git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
Build:
cmake -B build -DGGML_CUDA=ON -DGGML_CURL=ON
cmake --build build --config Release -j 20
Drop cache before and between runs: sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
Test and download the model first:
build/bin/llama-server -hf ggml-org/gpt-oss-120b-GGUF -fa 1 -ngl 999 -ub 2048 -c 0
Drop cache and time the model loading with mmap first:
sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
time build/bin/llama-server -hf ggml-org/gpt-oss-120b-GGUF -fa 1 -ngl 999 -ub 2048 -c 0
Once you see this, hit ctrl-c to quit (I didn’t find a way to load model and immediately quit in a non-interactive way):
main: server is listening on http://127.0.0.1:8080 - starting the main loop
srv update_slots: all slots are idle
It took 1 minute 48 seconds on my machine:
real 1m48.705s
If you run again, it will be somewhat faster if the memory is not reclaimed yet. I measured 1 minute 8 seconds.
Now, try without mmap:
Drop cache first:
sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
Run without mmap:
time build/bin/llama-server -hf ggml-org/gpt-oss-120b-GGUF -fa 1 -ngl 999 -ub 2048 -c 0 --no-mmap
Same drill, quit when the server is started.
Note a faster loading speed: real 1m8.346s
This is all on stock DGX OS:
eugr@spark:~/llm/llama.cpp$ uname -a
Linux spark 6.11.0-1016-nvidia #16-Ubuntu SMP PREEMPT_DYNAMIC Sun Sep 21 16:52:46 UTC 2025 aarch64 aarch64 aarch64 GNU/Linux
Now, the real kicker, kernel 6.17
Running Fedora 43 with a kernel compiled from NV-Kernels repository:
eugr@spark:~/llm/llama.cpp$ uname -a
Linux spark 6.17.1-nvidia-v6+ #8 SMP PREEMPT_DYNAMIC Tue Nov 4 00:31:25 PST 2025 aarch64 GNU/Linux
With mmap: real 1m50.893s - about the same as on stock kernel.
Without mmap: real 0m22.509s
As you can see, mmap performance stays bad, but no-mmap improves significantly with 6.17 kernel. I suspect hugepage support has something to do with it, as when I tried 6.11-64k kernel on stock DGX OS, it had a similar no-mmap performance, but unfortunately couldn’t load a model with mmap at all, so I had to roll back to 6.11.
But 6.17 with NVidia patches is very stable, and much faster in model loading without mmap. Unfortunately, while llama.cpp handles no-mmap well, some other apps don’t, notably vllm.
In vllm, specifying --safetensors-load-strategy eager improves Qwen/Qwen3-Next-80B-A3B-FP8 loading time from 8 minutes (!!!) to only 1 minute 30 seconds (!) on 6.17 kernel, but takes additional 20GB of RAM.
Hope this helps. Let me know if you need any more info.