Has anyone tried an alternative Linux distro?

Looks like NixOS is using the r8169 driver, which is likely an issue!:

[graham@nixos:~]$ lspci | grep Ethernet
0000:01:00.0 Ethernet controller: Mellanox Technologies MT2910 Family [ConnectX-7]
0000:01:00.1 Ethernet controller: Mellanox Technologies MT2910 Family [ConnectX-7]
0002:01:00.0 Ethernet controller: Mellanox Technologies MT2910 Family [ConnectX-7]
0002:01:00.1 Ethernet controller: Mellanox Technologies MT2910 Family [ConnectX-7]
0007:01:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. Device 8127 (rev 05)

[graham@nixos:~]$ lspci -k -s 0007:01:00.0
0007:01:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. Device 8127 (rev 05)
	Subsystem: Realtek Semiconductor Co., Ltd. Device 0123
	Kernel driver in use: r8169
	Kernel modules: r8169

It doesn’t look like my kernel has it:

[graham@nixos:/tmp]$ ls linux-6.17.5/drivers/net/ethernet/realtek/
8139cp.c   atp.c  Kconfig   r8169_firmware.c  r8169.h       r8169_main.c        rtase
8139too.c  atp.h  Makefile  r8169_firmware.h  r8169_leds.c  r8169_phy_config.c

But the NVIDIA kernel referenced above does have it, via an NVIDIA commit:

I’ll try to build the NVIDIA kernel this weekend. Thanks!

I think you nailed it. No module for r8127, it’s using r8169 on Fedora.
I guess I’ll try to download 8127 from realtek and blacklist 8169.

Well, I did it!

Compiled this branch: GitHub - NVIDIA/NV-Kernels at 24.04_linux-nvidia-6.17-next
It’s based on 6.17.1 which lags behind the latest stable in 6.17 branch, but not as much as 6.11.

I reused config from DGX OS, but set a few parameters:

    1. Scheduler controlled preemption model (PREEMPT_LAZY) (NEW)
  • Module versioning implementation: genksyms
  • No per-page mapcount (EXPERIMENTAL) (NO_PAGE_MAPCOUNT) [N/y/?] (NEW) - SET TO YES!!!
    • looks like this is the most important option as it improves large memory copy operations
  • Enable r8127 as module (disabled by default)
  • CPU Power Management/CPU Frequency scaling/Default CPUFreq governor: schedutil (CONFIG_CPU_FREQ_DEFAULT_GOV_SCHEDUTIL=y) so the CPU frequency scales down when idle

Left the rest as default. I think I turned on a few related to Mellanox stuff and some ARM-related, but don’t remember which ones.

Need to unset these ones if not using Ubuntu: CONFIG_SYSTEM_TRUSTED_KEYS, CONFIG_SYSTEM_REVOCATION_KEYS (or set to new values).

Either disable r8169 in config, or in/etc/modprobe.d/blacklist.conf:

blacklist r8169

Otherwise it will load by default.

Compile, install kernel, sudo poweroff, then power on to avoid Ethernet adapter malfunction.
Check that r8127 is loaded via lspci -k | grep r8127

Install NVIDIA drivers.
Reboot. Note that Ethernet is not glitching anymore!

Make sure nvidia-persistenced is enabled, so power management works properly.

So, with all that, I’m getting the same benchmarks in llama.cpp that I was getting in DGX OS, but the model loading is now 4-5x faster! 20 seconds vs more than a minute for gpt-oss-120b!

Benchmarks:

model size params backend test t/s
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA pp2048 1956.03 ± 9.28
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA tg32 60.57 ± 0.25
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA pp2048 @ d4096 1637.34 ± 4.86
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA tg32 @ d4096 54.14 ± 0.08
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA pp2048 @ d8192 1512.01 ± 5.66
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA tg32 @ d8192 51.54 ± 0.14
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA pp2048 @ d16384 1307.42 ± 3.79
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA tg32 @ d16384 47.45 ± 0.08
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA pp2048 @ d32768 1027.31 ± 4.79
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA tg32 @ d32768 40.55 ± 0.13
3 Likes

Thanks, this is working well on NixOS. The network interface stays up after a reboot. I found I didn’t need to change most of the options you mentioned, since the NVIDIA-provided defconfig seemed reasonable. The only ones I overrode were:

CONFIG_FAULT_INJECTION = lib.mkForce no; # fault injection might add overhead/don't think I need it?
CONFIG_SECURITY_APPARMOR_RESTRICT_USERNS = yes; # NixOS enables AppArmor by default
CONFIG_UBUNTU_HOST = no; # Not Ubuntu!

Here’s the config if anyone’s interested: nixos-dgx-spark/modules/dgx-spark.nix at main Ā· graham33/nixos-dgx-spark Ā· GitHub

2 Likes

@eugr what benchmark are you using above exactly? I’d be interested to see if I can reproduce your numbers. Thanks.

Sure, here we go (first, build llama.cpp from the source):

To test model loading time and download the model from Huggingface:

build/bin/llama-server -hf ggml-org/gpt-oss-120b-GGUF -fa 1 -ngl 999 -ub 2048 -c 0 --jinja --reasoning-format auto --no-mmap

Then run benchmark:

build/bin/llama-bench -m ~/.cache/llama.cpp/ggml-org_gpt-oss-120b-GGUF_gpt-oss-120b-mxfp4-00001-of-00003.gguf -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048 -mmp 0
1 Like

So, 6.17.1 solved model loading issues without mmap, but mmap performance is still mediocre :( Not a problem with llama.cpp, but absolutely a problem with pyTorch based workflows, like vllm.

Now you can speed it up by using --safetensors-load-strategy eager, but then vllm process takes a lot of additional CPU memory (like 1/4 of the overall model size). For Qwen3-Next-80B in FP8 quant, that’s additional 20GB of RAM!

I created a separate post about it - want to know if NVIDIA is aware and whether there are any plans on fixing this.

I’m not familiar with NixOS kernel build process, but how does r8127 get included in your kernel if the config option is not set in arm64 defconfig? Am I missing something?

@eugr please see debian.nvidia-6.17/config/annotations for details:

CONFIG_R8127 policy<{ā€˜amd64’: ā€˜n’, ā€˜arm64’: ā€˜m’}>

It builds the module if ARM is detected.

So, nix can use debian annotations to build the kernel?

Definitely doesn’t work when using standard Makefile. I based my .config on the one from DGX OS that came with Spark, so it works, but when I tried to generate one using defconfig, it didn’t even boot.

Sorry for the late response. In this case I’m telling nix to use defconfig, which ends up doing a make defconfig in the kernel source, which pulls in `arch/$(ARCH)/configs/defconfig`. You’re right that it doesn’t have CONFIG_R8127=m, so I assume this must be implied by another option

BTW, I generated .config using Debian’s annotations script that picked up NVIDIA-specific annotations. It differs in some options from mine, but works ok. Just needed to replace AppArmor with Fedora by default. I also changed default CPU governor to schedutil, so it ramps down frequencies at idle.

Turned out NO_MAPCOUNT option doesn’t really affect performance. What does is hugepage support, but it’s on in NVidia annotations too.

Out of curiosity tried 6.14 kernel from DGX OS packages. The regular version didn’t make any difference, 64k pages version improved --no-mmap loading, but hangs on mmap allocations, so rolled it back to the default kernel for now.

1 Like

On another note, I tried out pytorch 2.9 with CUDA 13 on NixOS, and am running into a weird issue loading torch CUDA support (which causes torch.cuda.is_available() to return False):

grep -i -A 10 -B 10 error /tmp/ld.out

    271892:

    271892:

    271892:     calling init: /nix/store/ijci9ppcf6sfnhah49q75ncqjsmi7ngz-cuda13.0-libcublas-13.1.0.3-lib/lib/libcublas.so.13

    271892:

    271892:

    271892:     calling init: /nix/store/z3di9xkjrk95sv8fjryk767767vdqh3h-cuda13.0-libcurand-10.4.0.35-lib/lib/libcurand.so.10

    271892:

    271892:

    271892:     calling init: /nix/store/61skgh54ddpqz6xnpwg3sq81fs9k9qj8-cuda13.0-libcusparse_lt-0.8.1.1-lib/lib/libcusparseLt.so.0

    271892:

    271892:     /nix/store/m0b67b3lmjcxa8aplpl75qpb26gr5vsf-python3-3.13.8/bin/python: error: symbol lookup error: undefined symbol: InitializeInjectionCaskEnhanced (fatal)

    271892:

    271892:     calling init: /nix/store/xwxym0y0az7sd48x3gkxi9kl4nw28c3k-cuda13.0-libcufft-12.0.0.61-lib/lib/libcufft.so.12

    271892:

    271892:

    271892:     calling init: /nix/store/8fjxxc3jnb16x4ka014gda3d31rjyviz-cuda13.0-libcusparse-12.6.3.3-lib/lib/libcusparse.so.12

    271892:

    271892:

    271892:     calling init: /nix/store/zmjqwzhgl9hwr8xnk8raj8d50lzkql66-gcc-14.3.0-lib/lib/libgomp.so.1

    271892:

    271892:

Just posting here on the off-chance that someone else has seen it!

I haven’t seen this issue, but what I found out is that for some workflows, pytorch with cuda 13 is slower than pytorch with cuda 12.9. One example is ComfyUI - I get 96 seconds on cu129 vs 139 seconds on cu130 using default Qwen Image workflow. Both are still using installed Cuda 13 SDK. Doesn’t seem to make any difference for vllm though.

Thanks for all the tips on getting Fedora 43 working. I decided to skip the ethernet issue by using a USB Ethernet dongle instead of the Ethernet port.

I decided to install the Fedora Server version, and made it headless, with a minimal install and only install packages as needed. I wanted ā€œnvidia-smiā€ to have the same output as the original build. And I got that now:

root@fedora:~# uname -a
Linux fedora 6.17.7-300.fc43.aarch64 #1 SMP PREEMPT_DYNAMIC Sun Nov  2 15:33:04 UTC 2025 aarch64 GNU/Linux
root@fedora:~# nvidia-smi
Fri Nov  7 13:41:20 2025
±----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05              Driver Version: 580.95.05      CUDA Version: 13.0     |
±----------------------------------------±-----------------------±---------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GB10                    On  |   0000000F:01:00.0 Off |                  N/A |
| N/A   40C    P8              5W /  N/A  | Not Supported          |      0%      Default |
|                                         |                        |                  N/A |
±----------------------------------------±-----------------------±---------------------+

±----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
±----------------------------------------------------------------------------------------+

For others who want to know the setup I did to get the same config (as root):

nvidia-smi -pm 1
systemctl enable --now nvidia-persistenced
grubby --update-kernel=ALL --args=ā€œnvidia-drm.modeset=0ā€
reboot

Just be aware, that stock Fedora kernel is significantly slower in GPU workloads. I haven’t tried 6.17.7 yet, but I don’t think any NVIDIA-specific changes made it there. However, Fedora 43 Server with custom built kernel performs better than stock DGX OS :)

1 Like

Thanks for the heads up on the kernel differences. I wish I could spend more time on this. It definitely feels like it is a memory access problem. I do have two devices, one I have left using DGX OS. That does make it easier to figure out the differences. I would be interested to know what they changed to make DGX OS quicker with the GPU workloads!