I reused config from DGX OS, but set a few parameters:
Scheduler controlled preemption model (PREEMPT_LAZY) (NEW)
Module versioning implementation: genksyms
No per-page mapcount (EXPERIMENTAL) (NO_PAGE_MAPCOUNT) [N/y/?] (NEW) - SET TO YES!!!
looks like this is the most important option as it improves large memory copy operations
Enable r8127 as module (disabled by default)
CPU Power Management/CPU Frequency scaling/Default CPUFreq governor: schedutil (CONFIG_CPU_FREQ_DEFAULT_GOV_SCHEDUTIL=y) so the CPU frequency scales down when idle
Left the rest as default. I think I turned on a few related to Mellanox stuff and some ARM-related, but donāt remember which ones.
Need to unset these ones if not using Ubuntu: CONFIG_SYSTEM_TRUSTED_KEYS, CONFIG_SYSTEM_REVOCATION_KEYS (or set to new values).
Either disable r8169 in config, or in/etc/modprobe.d/blacklist.conf:
blacklist r8169
Otherwise it will load by default.
Compile, install kernel, sudo poweroff, then power on to avoid Ethernet adapter malfunction.
Check that r8127 is loaded via lspci -k | grep r8127
Install NVIDIA drivers.
Reboot. Note that Ethernet is not glitching anymore!
Make sure nvidia-persistenced is enabled, so power management works properly.
So, with all that, Iām getting the same benchmarks in llama.cpp that I was getting in DGX OS, but the model loading is now 4-5x faster! 20 seconds vs more than a minute for gpt-oss-120b!
Thanks, this is working well on NixOS. The network interface stays up after a reboot. I found I didnāt need to change most of the options you mentioned, since the NVIDIA-provided defconfig seemed reasonable. The only ones I overrode were:
CONFIG_FAULT_INJECTION = lib.mkForce no; # fault injection might add overhead/don't think I need it? CONFIG_SECURITY_APPARMOR_RESTRICT_USERNS = yes; # NixOS enables AppArmor by default CONFIG_UBUNTU_HOST = no; # Not Ubuntu!
So, 6.17.1 solved model loading issues without mmap, but mmap performance is still mediocre :( Not a problem with llama.cpp, but absolutely a problem with pyTorch based workflows, like vllm.
Now you can speed it up by using --safetensors-load-strategy eager, but then vllm process takes a lot of additional CPU memory (like 1/4 of the overall model size). For Qwen3-Next-80B in FP8 quant, thatās additional 20GB of RAM!
I created a separate post about it - want to know if NVIDIA is aware and whether there are any plans on fixing this.
Iām not familiar with NixOS kernel build process, but how does r8127 get included in your kernel if the config option is not set in arm64 defconfig? Am I missing something?
So, nix can use debian annotations to build the kernel?
Definitely doesnāt work when using standard Makefile. I based my .config on the one from DGX OS that came with Spark, so it works, but when I tried to generate one using defconfig, it didnāt even boot.
Sorry for the late response. In this case Iām telling nix to use defconfig, which ends up doing a make defconfig in the kernel source, which pulls in `arch/$(ARCH)/configs/defconfig`. Youāre right that it doesnāt have CONFIG_R8127=m, so I assume this must be implied by another option
BTW, I generated .config using Debianās annotations script that picked up NVIDIA-specific annotations. It differs in some options from mine, but works ok. Just needed to replace AppArmor with Fedora by default. I also changed default CPU governor to schedutil, so it ramps down frequencies at idle.
Turned out NO_MAPCOUNT option doesnāt really affect performance. What does is hugepage support, but itās on in NVidia annotations too.
Out of curiosity tried 6.14 kernel from DGX OS packages. The regular version didnāt make any difference, 64k pages version improved --no-mmap loading, but hangs on mmap allocations, so rolled it back to the default kernel for now.
On another note, I tried out pytorch 2.9 with CUDA 13 on NixOS, and am running into a weird issue loading torch CUDA support (which causes torch.cuda.is_available() to return False):
I havenāt seen this issue, but what I found out is that for some workflows, pytorch with cuda 13 is slower than pytorch with cuda 12.9. One example is ComfyUI - I get 96 seconds on cu129 vs 139 seconds on cu130 using default Qwen Image workflow. Both are still using installed Cuda 13 SDK. Doesnāt seem to make any difference for vllm though.
Thanks for all the tips on getting Fedora 43 working. I decided to skip the ethernet issue by using a USB Ethernet dongle instead of the Ethernet port.
I decided to install the Fedora Server version, and made it headless, with a minimal install and only install packages as needed. I wanted ānvidia-smiā to have the same output as the original build. And I got that now:
root@fedora:~# uname -a
Linux fedora 6.17.7-300.fc43.aarch64 #1 SMP PREEMPT_DYNAMIC Sun Nov 2 15:33:04 UTC 2025 aarch64 GNU/Linux
root@fedora:~# nvidia-smi
Fri Nov 7 13:41:20 2025
±----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05 Driver Version: 580.95.05 CUDA Version: 13.0 |
±----------------------------------------±-----------------------±---------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GB10 On | 0000000F:01:00.0 Off | N/A |
| N/A 40C P8 5W / N/A | Not Supported | 0% Default |
| | | N/A |
±----------------------------------------±-----------------------±---------------------+
±----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
±----------------------------------------------------------------------------------------+
For others who want to know the setup I did to get the same config (as root):
Just be aware, that stock Fedora kernel is significantly slower in GPU workloads. I havenāt tried 6.17.7 yet, but I donāt think any NVIDIA-specific changes made it there. However, Fedora 43 Server with custom built kernel performs better than stock DGX OS :)
Thanks for the heads up on the kernel differences. I wish I could spend more time on this. It definitely feels like it is a memory access problem. I do have two devices, one I have left using DGX OS. That does make it easier to figure out the differences. I would be interested to know what they changed to make DGX OS quicker with the GPU workloads!