Very slow mmap on DGX Spark that affects model loading - questions to NVIDIA

So, I was trying to get max performance from my DGX Spark for the past week, and noticed that memory mapping (mmap) performance is VERY slow on it.

Well, actually with stock kernel, regular memory copy performance is slow too, but it is solved in 6.17.x kernel likely by introducing NO_PAGE_MAPCOUNT option that significantly reduces allocation overhead.

But unfortunately it doesn’t solve mmap issues.

Some numbers:

  • Loading gpt-oss-120b using llama.cpp on stock kernel (6.11)
    • with mmap: 1 minute 44 seconds
    • without mmap (–no-mmap): 56 seconds
  • Same, but on 6.17.1-nvidia kernel (built from NV-Kernels repo)
    • with mmap: 1 minute 30 seconds
    • without mmap: 22 seconds !!!

I also tried vLLM. Using --safetensors-load-strategy eager (disabling mmap) resulted in significant speedup too (testing on Qwen3-Next-80b-A3b-FP8 model). Numbers only for 6.17.1 kernel - haven’t written down benchmarks on stock kernel, but it was also super slow:

  • default settings (mmap): 8 minutes 41 seconds
  • load strategy eager (no mmap): 1 minute 28 seconds (!!!)

Unfortunately, when using eager mode, VLLM process takes way too much “CPU” RAM, reducing available “VRAM”.

Question to NVIDIA: are you aware of these issues? Any plans to fix it?

Hi, thank you for reaching out. We will investigate why using mmap seems to slow down performance

1 Like

Thanks for you reply! Looking forward to the updates!

Any ETA on providing an official 6.17.x kernel to DGX OS? It doesn’t solve mmap issue, but it significantly improves loading performance with mmap off.

Looks my initial assumption of NO_PAGE_MAPCOUNT being the main reason for this speedup was not correct. It does seem to improve mmap performance a little bit though, but not much.

Just tried on a stock kernel (6.11), vllm loading of Qwen3-Next-80B-A3B-FP8:

  • default settings (using mmap): 9 minutes 19 seconds
  • load strategy eager (no mmap): 2 minutes 11 seconds

Hi Eugr,

The K6.17 kernel will be available for DGX OS in January 2026.

As a general guideline, if the model’s memory usage is less than the available free memory, it’s better to use no-mmap.

Could you share the steps you followed to run the experiment with and without mmap? We’d like to investigate the bottleneck further.

thanks

Bibek

1 Like

The best way to test it is with llama.cpp, but it’s even more dramatic with vllm.

First, install llama.cpp.

Install development tools:

sudo apt install clang cmake libcurl4-openssl-dev

Checkout llama.cpp

git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp

Build:

cmake -B build -DGGML_CUDA=ON -DGGML_CURL=ON
cmake --build build --config Release -j 20

Drop cache before and between runs: sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'

Test and download the model first:

build/bin/llama-server -hf ggml-org/gpt-oss-120b-GGUF -fa 1 -ngl 999 -ub 2048 -c 0

Drop cache and time the model loading with mmap first:

sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
time build/bin/llama-server -hf ggml-org/gpt-oss-120b-GGUF -fa 1 -ngl 999 -ub 2048 -c 0

Once you see this, hit ctrl-c to quit (I didn’t find a way to load model and immediately quit in a non-interactive way):

main: server is listening on http://127.0.0.1:8080 - starting the main loop
srv  update_slots: all slots are idle

It took 1 minute 48 seconds on my machine:
real 1m48.705s

If you run again, it will be somewhat faster if the memory is not reclaimed yet. I measured 1 minute 8 seconds.

Now, try without mmap:

Drop cache first:

sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'

Run without mmap:

time build/bin/llama-server -hf ggml-org/gpt-oss-120b-GGUF -fa 1 -ngl 999 -ub 2048 -c 0 --no-mmap

Same drill, quit when the server is started.

Note a faster loading speed: real 1m8.346s

This is all on stock DGX OS:

eugr@spark:~/llm/llama.cpp$ uname -a
Linux spark 6.11.0-1016-nvidia #16-Ubuntu SMP PREEMPT_DYNAMIC Sun Sep 21 16:52:46 UTC 2025 aarch64 aarch64 aarch64 GNU/Linux

Now, the real kicker, kernel 6.17

Running Fedora 43 with a kernel compiled from NV-Kernels repository:

eugr@spark:~/llm/llama.cpp$ uname -a
Linux spark 6.17.1-nvidia-v6+ #8 SMP PREEMPT_DYNAMIC Tue Nov  4 00:31:25 PST 2025 aarch64 GNU/Linux

With mmap: real 1m50.893s - about the same as on stock kernel.
Without mmap: real 0m22.509s

As you can see, mmap performance stays bad, but no-mmap improves significantly with 6.17 kernel. I suspect hugepage support has something to do with it, as when I tried 6.11-64k kernel on stock DGX OS, it had a similar no-mmap performance, but unfortunately couldn’t load a model with mmap at all, so I had to roll back to 6.11.

But 6.17 with NVidia patches is very stable, and much faster in model loading without mmap. Unfortunately, while llama.cpp handles no-mmap well, some other apps don’t, notably vllm.

In vllm, specifying --safetensors-load-strategy eager improves Qwen/Qwen3-Next-80B-A3B-FP8 loading time from 8 minutes (!!!) to only 1 minute 30 seconds (!) on 6.17 kernel, but takes additional 20GB of RAM.

Hope this helps. Let me know if you need any more info.