Very slow mmap on DGX Spark that affects model loading - questions to NVIDIA

So, I was trying to get max performance from my DGX Spark for the past week, and noticed that memory mapping (mmap) performance is VERY slow on it.

Well, actually with stock kernel, regular memory copy performance is slow too, but it is solved in 6.17.x kernel likely by introducing NO_PAGE_MAPCOUNT option that significantly reduces allocation overhead.

But unfortunately it doesn’t solve mmap issues.

Some numbers:

  • Loading gpt-oss-120b using llama.cpp on stock kernel (6.11)
    • with mmap: 1 minute 44 seconds
    • without mmap (–no-mmap): 56 seconds
  • Same, but on 6.17.1-nvidia kernel (built from NV-Kernels repo)
    • with mmap: 1 minute 30 seconds
    • without mmap: 22 seconds !!!

I also tried vLLM. Using --safetensors-load-strategy eager (disabling mmap) resulted in significant speedup too (testing on Qwen3-Next-80b-A3b-FP8 model). Numbers only for 6.17.1 kernel - haven’t written down benchmarks on stock kernel, but it was also super slow:

  • default settings (mmap): 8 minutes 41 seconds
  • load strategy eager (no mmap): 1 minute 28 seconds (!!!)

Unfortunately, when using eager mode, VLLM process takes way too much “CPU” RAM, reducing available “VRAM”.

Question to NVIDIA: are you aware of these issues? Any plans to fix it?

2 Likes

Hi, thank you for reaching out. We will investigate why using mmap seems to slow down performance

1 Like

Thanks for you reply! Looking forward to the updates!

Any ETA on providing an official 6.17.x kernel to DGX OS? It doesn’t solve mmap issue, but it significantly improves loading performance with mmap off.

Looks my initial assumption of NO_PAGE_MAPCOUNT being the main reason for this speedup was not correct. It does seem to improve mmap performance a little bit though, but not much.

Just tried on a stock kernel (6.11), vllm loading of Qwen3-Next-80B-A3B-FP8:

  • default settings (using mmap): 9 minutes 19 seconds
  • load strategy eager (no mmap): 2 minutes 11 seconds
2 Likes

Hi Eugr,

The K6.17 kernel will be available for DGX OS in January 2026.

As a general guideline, if the model’s memory usage is less than the available free memory, it’s better to use no-mmap.

Could you share the steps you followed to run the experiment with and without mmap? We’d like to investigate the bottleneck further.

thanks

Bibek

2 Likes

The best way to test it is with llama.cpp, but it’s even more dramatic with vllm.

First, install llama.cpp.

Install development tools:

sudo apt install clang cmake libcurl4-openssl-dev

Checkout llama.cpp

git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp

Build:

cmake -B build -DGGML_CUDA=ON -DGGML_CURL=ON
cmake --build build --config Release -j 20

Drop cache before and between runs: sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'

Test and download the model first:

build/bin/llama-server -hf ggml-org/gpt-oss-120b-GGUF -fa 1 -ngl 999 -ub 2048 -c 0

Drop cache and time the model loading with mmap first:

sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
time build/bin/llama-server -hf ggml-org/gpt-oss-120b-GGUF -fa 1 -ngl 999 -ub 2048 -c 0

Once you see this, hit ctrl-c to quit (I didn’t find a way to load model and immediately quit in a non-interactive way):

main: server is listening on http://127.0.0.1:8080 - starting the main loop
srv  update_slots: all slots are idle

It took 1 minute 48 seconds on my machine:
real 1m48.705s

If you run again, it will be somewhat faster if the memory is not reclaimed yet. I measured 1 minute 8 seconds.

Now, try without mmap:

Drop cache first:

sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'

Run without mmap:

time build/bin/llama-server -hf ggml-org/gpt-oss-120b-GGUF -fa 1 -ngl 999 -ub 2048 -c 0 --no-mmap

Same drill, quit when the server is started.

Note a faster loading speed: real 1m8.346s

This is all on stock DGX OS:

eugr@spark:~/llm/llama.cpp$ uname -a
Linux spark 6.11.0-1016-nvidia #16-Ubuntu SMP PREEMPT_DYNAMIC Sun Sep 21 16:52:46 UTC 2025 aarch64 aarch64 aarch64 GNU/Linux

Now, the real kicker, kernel 6.17

Running Fedora 43 with a kernel compiled from NV-Kernels repository:

eugr@spark:~/llm/llama.cpp$ uname -a
Linux spark 6.17.1-nvidia-v6+ #8 SMP PREEMPT_DYNAMIC Tue Nov  4 00:31:25 PST 2025 aarch64 GNU/Linux

With mmap: real 1m50.893s - about the same as on stock kernel.
Without mmap: real 0m22.509s

As you can see, mmap performance stays bad, but no-mmap improves significantly with 6.17 kernel. I suspect hugepage support has something to do with it, as when I tried 6.11-64k kernel on stock DGX OS, it had a similar no-mmap performance, but unfortunately couldn’t load a model with mmap at all, so I had to roll back to 6.11.

But 6.17 with NVidia patches is very stable, and much faster in model loading without mmap. Unfortunately, while llama.cpp handles no-mmap well, some other apps don’t, notably vllm.

In vllm, specifying --safetensors-load-strategy eager improves Qwen/Qwen3-Next-80B-A3B-FP8 loading time from 8 minutes (!!!) to only 1 minute 30 seconds (!) on 6.17 kernel, but takes additional 20GB of RAM.

Hope this helps. Let me know if you need any more info.

2 Likes

Hi @eugr , –no-map is better for fast loading of big models. Could you please share the reference point you are comparing with and the time?

1 Like

See the entire post above yours. I measured loading on different kernels with or without mmap on both llama.cpp and vllm, and the steps to reproduce.

Looks like the new 6.14-hwe kernel improves --no-mmap performance for llama.cpp from 1 minute 8 seconds to 27 seconds for the same gpt-oss-120b model. Still not as fast as 6.17 kernel (22 seconds), but very close now.

Unfortunately, mmap performance remains bad - still 1 minute 43 seconds for this model.

With mmap, loading huge models can be slower due to lot of page faults and related overhead as ~60GB is brought lazily page by page into page cache.

Could you run below command and check.

$ sudo bash -c "echo 8192 > /sys/block/nvme0n1/queue/read_ahead_kb"

Increasing ‘read_ahead_kb’ for NVME can help the kernel prefetch big chunks and reduces page-faults.

I tried it and observed that in Kernel-v6.17, mmap time reduced by ~50% and no-mmap time reduced by ~35%.

In Kernel-v6.14, mmap time didn’t reduce significantly and no-map reduced by ~50%.

This can be due to improvements in ‘read_ahead’ related Kernel code between v6.14 to v6.17.

I’ll try that, but I suspect there is something else in play here too. I’m going to test with the mainline 6.17 kernel to see if I see any difference too.

On my AMD Strix Halo (gfx1151, AMD AI Max+ 395) system, which is kinda direct competitor to Spark and has similar memory bandwidth and unified memory setup (although it uses their infinity fabric to connect GPU to memory bus), I’m able to load the same model in the same 28 seconds from cold (and 20 seconds with --no-mmap).

Running Fedora 43 with it’s regular kernel:

Linux ai 6.17.7-300.fc43.x86_64 #1 SMP PREEMPT_DYNAMIC Sun Nov  2 15:30:09 UTC 2025 x86_64 GNU/Linux

Read-ahead settings:

$ cat /sys/block/nvme0n1/queue/read_ahead_kb
128

I’ll get back to you with Spark results using your suggestions and also using Fedora kernel later today.

So, there was an interesting post in another thread about RDMA GPU Direct and I wonder if it’s all related to mmap performance?

The post says:

For performance reasons, specifically for CUDA contexts associated to the iGPU, the system memory returned by the pinned device memory allocators (e.g. cudaMalloc) cannot be coherently accessed by the CPU complex nor by I/O peripherals like PCI Express devices.

Hence the GPUDirect RDMA technology is not supported, and the mechanisms for direct I/O based on that technology, for example nvidia-peermem (for DOCA-Host), dma-buf or GDRCopy, do not work.

So, I’m thinking, could that be the design flaw that affects mmap performance? IOW, nvme-DMA->unified RAM->GPU path becomes broken and looks something like nvme-DMA->RAM-memcpy->(V)RAM->GPU?

1 Like

Tried on 6.17 kernel (built by NVIDIA, from proposed packages) - I’m seeing the same improvement:

  • mmap: 2 minutes → 28 seconds
  • no-mmap: 25 seconds → 15 seconds

Unfortunately, no improvement for vLLM - still takes >8 minutes to load Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 (takes 1 minute 30 seconds on my Strix Halo box).

Just in: GitHub - napmany/llmsnap: Lightning-fast LLM swapping with sleep/wake support, compatible with vllm, llama.cpp, etc

not tested yet - uses vLLM sleep mode

Source: Reddit - The heart of the internet

1 Like

A few notes:

  • This is a rebranded fork of llama-swap without any attribution.
  • The only features added are sleep mode support.
  • VLLM sleep mode is useless on Spark, because it just offloads the model weights from GPU VRAM to CPU RAM, which in case of unified memory just adds extra work with no benefit.
4 Likes

Dammit. I didn’t get that the vLLM sleep mode is just offloading is only to CPU RAM. What a pity.

The author wrote that he wanted it to merge into llama-swap, but they didn’t want it, because most llama-swap users use llama.cpp instead of vLLM.

1 Like

Yeah, I’ve just read the Reddit post about it.

Asking around in the vLLM community Slack, I discovered that using fastsafetensors drastically improves my model loading time on vLLM and decreases memory usage when loading models, allowing me to load models with 70GB or larger weights (such as Llama 4 Scout NVFP4) without running out of unified memory.

I build vLLM from source, but however you’re installing/running it you need to get the fastsafetensors dependency in its python environment. For me, that’s:

uv pip install “fastsafetensors>=0.1.10”

Then, when starting vLLM, pass the additional option –load-format fastsafetensors. This took my time to load and get the referenced Llama 4 Scout model ready to serve on the DGX Spark from 7.5 minutes to 1 minute, and decreased the total unified memory required to load the weights by at least 40GB.

4 Likes

Thanks, I’ll try that!

Wow, this changes things! Thanks so much!
Qwen3-Next FP8 improved from almost 9 minutes to 24 seconds - crazy!!!

1 Like

Yeah, it does change. I’ve added that to my base docker container. Thanks for sharing it @ben3241

1 Like