I reused config from DGX OS, but set a few parameters:
Scheduler controlled preemption model (PREEMPT_LAZY) (NEW)
Module versioning implementation: genksyms
No per-page mapcount (EXPERIMENTAL) (NO_PAGE_MAPCOUNT) [N/y/?] (NEW) - SET TO YES!!!
looks like this is the most important option as it improves large memory copy operations
Enable r8127 as module (disabled by default)
CPU Power Management/CPU Frequency scaling/Default CPUFreq governor: schedutil (CONFIG_CPU_FREQ_DEFAULT_GOV_SCHEDUTIL=y) so the CPU frequency scales down when idle
Left the rest as default. I think I turned on a few related to Mellanox stuff and some ARM-related, but donāt remember which ones.
Need to unset these ones if not using Ubuntu: CONFIG_SYSTEM_TRUSTED_KEYS, CONFIG_SYSTEM_REVOCATION_KEYS (or set to new values).
Either disable r8169 in config, or in/etc/modprobe.d/blacklist.conf:
blacklist r8169
Otherwise it will load by default.
Compile, install kernel, sudo poweroff, then power on to avoid Ethernet adapter malfunction.
Check that r8127 is loaded via lspci -k | grep r8127
Install NVIDIA drivers.
Reboot. Note that Ethernet is not glitching anymore!
Make sure nvidia-persistenced is enabled, so power management works properly.
So, with all that, Iām getting the same benchmarks in llama.cpp that I was getting in DGX OS, but the model loading is now 4-5x faster! 20 seconds vs more than a minute for gpt-oss-120b!
Thanks, this is working well on NixOS. The network interface stays up after a reboot. I found I didnāt need to change most of the options you mentioned, since the NVIDIA-provided defconfig seemed reasonable. The only ones I overrode were:
CONFIG_FAULT_INJECTION = lib.mkForce no; # fault injection might add overhead/don't think I need it? CONFIG_SECURITY_APPARMOR_RESTRICT_USERNS = yes; # NixOS enables AppArmor by default CONFIG_UBUNTU_HOST = no; # Not Ubuntu!
So, 6.17.1 solved model loading issues without mmap, but mmap performance is still mediocre :( Not a problem with llama.cpp, but absolutely a problem with pyTorch based workflows, like vllm.
Now you can speed it up by using --safetensors-load-strategy eager, but then vllm process takes a lot of additional CPU memory (like 1/4 of the overall model size). For Qwen3-Next-80B in FP8 quant, thatās additional 20GB of RAM!
I created a separate post about it - want to know if NVIDIA is aware and whether there are any plans on fixing this.
Iām not familiar with NixOS kernel build process, but how does r8127 get included in your kernel if the config option is not set in arm64 defconfig? Am I missing something?
So, nix can use debian annotations to build the kernel?
Definitely doesnāt work when using standard Makefile. I based my .config on the one from DGX OS that came with Spark, so it works, but when I tried to generate one using defconfig, it didnāt even boot.