Very slow mmap on DGX Spark that affects model loading - questions to NVIDIA

eugr · November 2, 2025, 11:07pm

So, I was trying to get max performance from my DGX Spark for the past week, and noticed that memory mapping (mmap) performance is VERY slow on it.

Well, actually with stock kernel, regular memory copy performance is slow too, but it is solved in 6.17.x kernel likely by introducing NO_PAGE_MAPCOUNT option that significantly reduces allocation overhead.

But unfortunately it doesn’t solve mmap issues.

Some numbers:

Loading gpt-oss-120b using llama.cpp on stock kernel (6.11)
- with mmap: 1 minute 44 seconds
- without mmap (–no-mmap): 56 seconds
Same, but on 6.17.1-nvidia kernel (built from NV-Kernels repo)
- with mmap: 1 minute 30 seconds
- without mmap: 22 seconds !!!

I also tried vLLM. Using --safetensors-load-strategy eager (disabling mmap) resulted in significant speedup too (testing on Qwen3-Next-80b-A3b-FP8 model). Numbers only for 6.17.1 kernel - haven’t written down benchmarks on stock kernel, but it was also super slow:

default settings (mmap): 8 minutes 41 seconds
load strategy eager (no mmap): 1 minute 28 seconds (!!!)

Unfortunately, when using eager mode, VLLM process takes way too much “CPU” RAM, reducing available “VRAM”.

Question to NVIDIA: are you aware of these issues? Any plans to fix it?

aniculescu · November 4, 2025, 4:02pm

Hi, thank you for reaching out. We will investigate why using mmap seems to slow down performance

eugr · November 5, 2025, 12:22am

Thanks for you reply! Looking forward to the updates!

Any ETA on providing an official 6.17.x kernel to DGX OS? It doesn’t solve mmap issue, but it significantly improves loading performance with mmap off.

Looks my initial assumption of NO_PAGE_MAPCOUNT being the main reason for this speedup was not correct. It does seem to improve mmap performance a little bit though, but not much.

Just tried on a stock kernel (6.11), vllm loading of Qwen3-Next-80B-A3B-FP8:

default settings (using mmap): 9 minutes 19 seconds
load strategy eager (no mmap): 2 minutes 11 seconds

Bibek · November 11, 2025, 3:37am

Hi Eugr,

The K6.17 kernel will be available for DGX OS in January 2026.

As a general guideline, if the model’s memory usage is less than the available free memory, it’s better to use no-mmap.

Could you share the steps you followed to run the experiment with and without mmap? We’d like to investigate the bottleneck further.

thanks

Bibek

eugr · November 11, 2025, 6:30pm

The best way to test it is with llama.cpp, but it’s even more dramatic with vllm.

First, install llama.cpp.

Install development tools:

sudo apt install clang cmake libcurl4-openssl-dev

Checkout llama.cpp

git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp

Build:

cmake -B build -DGGML_CUDA=ON -DGGML_CURL=ON
cmake --build build --config Release -j 20

Drop cache before and between runs: sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'

Test and download the model first:

build/bin/llama-server -hf ggml-org/gpt-oss-120b-GGUF -fa 1 -ngl 999 -ub 2048 -c 0

Drop cache and time the model loading with mmap first:

sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
time build/bin/llama-server -hf ggml-org/gpt-oss-120b-GGUF -fa 1 -ngl 999 -ub 2048 -c 0

Once you see this, hit ctrl-c to quit (I didn’t find a way to load model and immediately quit in a non-interactive way):

main: server is listening on http://127.0.0.1:8080 - starting the main loop
srv  update_slots: all slots are idle

It took 1 minute 48 seconds on my machine:
real 1m48.705s

If you run again, it will be somewhat faster if the memory is not reclaimed yet. I measured 1 minute 8 seconds.

Now, try without mmap:

Drop cache first:

sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'

Run without mmap:

time build/bin/llama-server -hf ggml-org/gpt-oss-120b-GGUF -fa 1 -ngl 999 -ub 2048 -c 0 --no-mmap

Same drill, quit when the server is started.

Note a faster loading speed: real 1m8.346s

This is all on stock DGX OS:

eugr@spark:~/llm/llama.cpp$ uname -a
Linux spark 6.11.0-1016-nvidia #16-Ubuntu SMP PREEMPT_DYNAMIC Sun Sep 21 16:52:46 UTC 2025 aarch64 aarch64 aarch64 GNU/Linux

Now, the real kicker, kernel 6.17

Running Fedora 43 with a kernel compiled from NV-Kernels repository:

eugr@spark:~/llm/llama.cpp$ uname -a
Linux spark 6.17.1-nvidia-v6+ #8 SMP PREEMPT_DYNAMIC Tue Nov  4 00:31:25 PST 2025 aarch64 GNU/Linux

With mmap: real 1m50.893s - about the same as on stock kernel.
Without mmap: real 0m22.509s

As you can see, mmap performance stays bad, but no-mmap improves significantly with 6.17 kernel. I suspect hugepage support has something to do with it, as when I tried 6.11-64k kernel on stock DGX OS, it had a similar no-mmap performance, but unfortunately couldn’t load a model with mmap at all, so I had to roll back to 6.11.

But 6.17 with NVidia patches is very stable, and much faster in model loading without mmap. Unfortunately, while llama.cpp handles no-mmap well, some other apps don’t, notably vllm.

In vllm, specifying --safetensors-load-strategy eager improves Qwen/Qwen3-Next-80B-A3B-FP8 loading time from 8 minutes (!!!) to only 1 minute 30 seconds (!) on 6.17 kernel, but takes additional 20GB of RAM.

Hope this helps. Let me know if you need any more info.

sumitg · November 19, 2025, 1:42pm

Hi @eugr , –no-map is better for fast loading of big models. Could you please share the reference point you are comparing with and the time?

eugr · November 19, 2025, 3:49pm

See the entire post above yours. I measured loading on different kernels with or without mmap on both llama.cpp and vllm, and the steps to reproduce.

eugr · November 19, 2025, 5:51pm

Looks like the new 6.14-hwe kernel improves --no-mmap performance for llama.cpp from 1 minute 8 seconds to 27 seconds for the same gpt-oss-120b model. Still not as fast as 6.17 kernel (22 seconds), but very close now.

Unfortunately, mmap performance remains bad - still 1 minute 43 seconds for this model.

sumitg · November 20, 2025, 11:24am

With mmap, loading huge models can be slower due to lot of page faults and related overhead as ~60GB is brought lazily page by page into page cache.

Could you run below command and check.

$ sudo bash -c "echo 8192 > /sys/block/nvme0n1/queue/read_ahead_kb"

Increasing ‘read_ahead_kb’ for NVME can help the kernel prefetch big chunks and reduces page-faults.

I tried it and observed that in Kernel-v6.17, mmap time reduced by ~50% and no-mmap time reduced by ~35%.

In Kernel-v6.14, mmap time didn’t reduce significantly and no-map reduced by ~50%.

This can be due to improvements in ‘read_ahead’ related Kernel code between v6.14 to v6.17.

eugr · November 20, 2025, 6:08pm

I’ll try that, but I suspect there is something else in play here too. I’m going to test with the mainline 6.17 kernel to see if I see any difference too.

On my AMD Strix Halo (gfx1151, AMD AI Max+ 395) system, which is kinda direct competitor to Spark and has similar memory bandwidth and unified memory setup (although it uses their infinity fabric to connect GPU to memory bus), I’m able to load the same model in the same 28 seconds from cold (and 20 seconds with --no-mmap).

Running Fedora 43 with it’s regular kernel:

Linux ai 6.17.7-300.fc43.x86_64 #1 SMP PREEMPT_DYNAMIC Sun Nov  2 15:30:09 UTC 2025 x86_64 GNU/Linux

Read-ahead settings:

$ cat /sys/block/nvme0n1/queue/read_ahead_kb
128

I’ll get back to you with Spark results using your suggestions and also using Fedora kernel later today.

eugr · November 20, 2025, 6:31pm

So, there was an interesting post in another thread about RDMA GPU Direct and I wonder if it’s all related to mmap performance?

The post says:

For performance reasons, specifically for CUDA contexts associated to the iGPU, the system memory returned by the pinned device memory allocators (e.g. cudaMalloc) cannot be coherently accessed by the CPU complex nor by I/O peripherals like PCI Express devices.

Hence the GPUDirect RDMA technology is not supported, and the mechanisms for direct I/O based on that technology, for example nvidia-peermem (for DOCA-Host), dma-buf or GDRCopy, do not work.

So, I’m thinking, could that be the design flaw that affects mmap performance? IOW, nvme-DMA->unified RAM->GPU path becomes broken and looks something like nvme-DMA->RAM-memcpy->(V)RAM->GPU?

eugr · November 20, 2025, 8:29pm

Tried on 6.17 kernel (built by NVIDIA, from proposed packages) - I’m seeing the same improvement:

mmap: 2 minutes → 28 seconds
no-mmap: 25 seconds → 15 seconds

Unfortunately, no improvement for vLLM - still takes >8 minutes to load Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 (takes 1 minute 30 seconds on my Strix Halo box).

cosinus · November 23, 2025, 4:13pm

Just in: GitHub - napmany/llmsnap: Lightning-fast LLM swapping with sleep/wake support, compatible with vllm, llama.cpp, etc

not tested yet - uses vLLM sleep mode

Source: Reddit - The heart of the internet

eugr · November 23, 2025, 5:06pm

A few notes:

This is a rebranded fork of llama-swap without any attribution.
The only features added are sleep mode support.
VLLM sleep mode is useless on Spark, because it just offloads the model weights from GPU VRAM to CPU RAM, which in case of unified memory just adds extra work with no benefit.

cosinus · November 23, 2025, 5:21pm

Dammit. I didn’t get that the vLLM sleep mode is just offloading is only to CPU RAM. What a pity.

The author wrote that he wanted it to merge into llama-swap, but they didn’t want it, because most llama-swap users use llama.cpp instead of vLLM.

eugr · November 23, 2025, 6:04pm

Yeah, I’ve just read the Reddit post about it.

ben3241 · November 24, 2025, 4:35pm

Asking around in the vLLM community Slack, I discovered that using fastsafetensors drastically improves my model loading time on vLLM and decreases memory usage when loading models, allowing me to load models with 70GB or larger weights (such as Llama 4 Scout NVFP4) without running out of unified memory.

I build vLLM from source, but however you’re installing/running it you need to get the fastsafetensors dependency in its python environment. For me, that’s:

uv pip install “fastsafetensors>=0.1.10”

Then, when starting vLLM, pass the additional option –load-format fastsafetensors. This took my time to load and get the referenced Llama 4 Scout model ready to serve on the DGX Spark from 7.5 minutes to 1 minute, and decreased the total unified memory required to load the weights by at least 40GB.

eugr · November 24, 2025, 4:53pm

Thanks, I’ll try that!

eugr · November 24, 2025, 6:50pm

Wow, this changes things! Thanks so much!
Qwen3-Next FP8 improved from almost 9 minutes to 24 seconds - crazy!!!

raphael.amorim · November 24, 2025, 7:43pm

Yeah, it does change. I’ve added that to my base docker container. Thanks for sharing it @ben3241

Topic		Replies	Views
Apparently mmap is still slow on DGX Spark on Linux 6.17? DGX Spark / GB10 llama	1	222	February 13, 2026
Memory Creep on DGX Spark: Where Your 128 GB Actually Goes (And How to Stop It) DGX Spark / GB10 jetson , nemotron	2	550	March 30, 2026
vLLM custom for DGX Spark - STREAM LOADING and automatic KV cache DGX Spark / GB10 Projects nemotron	10	442	April 8, 2026
Buyers beware: DGX Spark limited to 64GB in ComfyUI DGX Spark / GB10	17	1889	March 27, 2026
Double memory use in Huggingface Qwen3 coder next DGX Spark / GB10	5	394	April 20, 2026
New bleeding-edge vLLM Docker Image: avarok/vllm-nvfp4-gb10-sm120 DGX Spark / GB10 Projects	35	2881	December 31, 2025
My DGX Spark Hangs ... is this normal? DGX Spark / GB10 Projects llm , dgx	4	234	April 13, 2026
Distributed Inference - 200gb/s with bottleneck, am I missing something? DGX Spark / GB10 llama	5	512	January 22, 2026
DGX Spark vs AMD Strix Halo DGX Spark / GB10 llama	4	6095	February 18, 2026
NVIDIA folks -- where is this promised nvfp4 speedup? DGX Spark / GB10	27	2565	March 26, 2026

Very slow mmap on DGX Spark that affects model loading - questions to NVIDIA

Related topics