Has anyone tried an alternative Linux distro?

Looks like NixOS is using the r8169 driver, which is likely an issue!:

[graham@nixos:~]$ lspci | grep Ethernet
0000:01:00.0 Ethernet controller: Mellanox Technologies MT2910 Family [ConnectX-7]
0000:01:00.1 Ethernet controller: Mellanox Technologies MT2910 Family [ConnectX-7]
0002:01:00.0 Ethernet controller: Mellanox Technologies MT2910 Family [ConnectX-7]
0002:01:00.1 Ethernet controller: Mellanox Technologies MT2910 Family [ConnectX-7]
0007:01:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. Device 8127 (rev 05)

[graham@nixos:~]$ lspci -k -s 0007:01:00.0
0007:01:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. Device 8127 (rev 05)
	Subsystem: Realtek Semiconductor Co., Ltd. Device 0123
	Kernel driver in use: r8169
	Kernel modules: r8169

It doesn’t look like my kernel has it:

[graham@nixos:/tmp]$ ls linux-6.17.5/drivers/net/ethernet/realtek/
8139cp.c   atp.c  Kconfig   r8169_firmware.c  r8169.h       r8169_main.c        rtase
8139too.c  atp.h  Makefile  r8169_firmware.h  r8169_leds.c  r8169_phy_config.c

But the NVIDIA kernel referenced above does have it, via an NVIDIA commit:

I’ll try to build the NVIDIA kernel this weekend. Thanks!

I think you nailed it. No module for r8127, it’s using r8169 on Fedora.
I guess I’ll try to download 8127 from realtek and blacklist 8169.

Well, I did it!

Compiled this branch: GitHub - NVIDIA/NV-Kernels at 24.04_linux-nvidia-6.17-next
It’s based on 6.17.1 which lags behind the latest stable in 6.17 branch, but not as much as 6.11.

I reused config from DGX OS, but set a few parameters:

    1. Scheduler controlled preemption model (PREEMPT_LAZY) (NEW)
  • Module versioning implementation: genksyms
  • No per-page mapcount (EXPERIMENTAL) (NO_PAGE_MAPCOUNT) [N/y/?] (NEW) - SET TO YES!!!
    • looks like this is the most important option as it improves large memory copy operations
  • Enable r8127 as module (disabled by default)
  • CPU Power Management/CPU Frequency scaling/Default CPUFreq governor: schedutil (CONFIG_CPU_FREQ_DEFAULT_GOV_SCHEDUTIL=y) so the CPU frequency scales down when idle

Left the rest as default. I think I turned on a few related to Mellanox stuff and some ARM-related, but don’t remember which ones.

Need to unset these ones if not using Ubuntu: CONFIG_SYSTEM_TRUSTED_KEYS, CONFIG_SYSTEM_REVOCATION_KEYS (or set to new values).

Either disable r8169 in config, or in/etc/modprobe.d/blacklist.conf:

blacklist r8169

Otherwise it will load by default.

Compile, install kernel, sudo poweroff, then power on to avoid Ethernet adapter malfunction.
Check that r8127 is loaded via lspci -k | grep r8127

Install NVIDIA drivers.
Reboot. Note that Ethernet is not glitching anymore!

Make sure nvidia-persistenced is enabled, so power management works properly.

So, with all that, I’m getting the same benchmarks in llama.cpp that I was getting in DGX OS, but the model loading is now 4-5x faster! 20 seconds vs more than a minute for gpt-oss-120b!

Benchmarks:

model size params backend test t/s
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA pp2048 1956.03 ± 9.28
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA tg32 60.57 ± 0.25
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA pp2048 @ d4096 1637.34 ± 4.86
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA tg32 @ d4096 54.14 ± 0.08
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA pp2048 @ d8192 1512.01 ± 5.66
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA tg32 @ d8192 51.54 ± 0.14
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA pp2048 @ d16384 1307.42 ± 3.79
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA tg32 @ d16384 47.45 ± 0.08
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA pp2048 @ d32768 1027.31 ± 4.79
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA tg32 @ d32768 40.55 ± 0.13
2 Likes

Thanks, this is working well on NixOS. The network interface stays up after a reboot. I found I didn’t need to change most of the options you mentioned, since the NVIDIA-provided defconfig seemed reasonable. The only ones I overrode were:

CONFIG_FAULT_INJECTION = lib.mkForce no; # fault injection might add overhead/don't think I need it?
CONFIG_SECURITY_APPARMOR_RESTRICT_USERNS = yes; # NixOS enables AppArmor by default
CONFIG_UBUNTU_HOST = no; # Not Ubuntu!

Here’s the config if anyone’s interested: nixos-dgx-spark/modules/dgx-spark.nix at main Ā· graham33/nixos-dgx-spark Ā· GitHub

2 Likes

@eugr what benchmark are you using above exactly? I’d be interested to see if I can reproduce your numbers. Thanks.

Sure, here we go (first, build llama.cpp from the source):

To test model loading time and download the model from Huggingface:

build/bin/llama-server -hf ggml-org/gpt-oss-120b-GGUF -fa 1 -ngl 999 -ub 2048 -c 0 --jinja --reasoning-format auto --no-mmap

Then run benchmark:

build/bin/llama-bench -m ~/.cache/llama.cpp/ggml-org_gpt-oss-120b-GGUF_gpt-oss-120b-mxfp4-00001-of-00003.gguf -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048 -mmp 0
1 Like

So, 6.17.1 solved model loading issues without mmap, but mmap performance is still mediocre :( Not a problem with llama.cpp, but absolutely a problem with pyTorch based workflows, like vllm.

Now you can speed it up by using --safetensors-load-strategy eager, but then vllm process takes a lot of additional CPU memory (like 1/4 of the overall model size). For Qwen3-Next-80B in FP8 quant, that’s additional 20GB of RAM!

I created a separate post about it - want to know if NVIDIA is aware and whether there are any plans on fixing this.

I’m not familiar with NixOS kernel build process, but how does r8127 get included in your kernel if the config option is not set in arm64 defconfig? Am I missing something?

@eugr please see debian.nvidia-6.17/config/annotations for details:

CONFIG_R8127 policy<{ā€˜amd64’: ā€˜n’, ā€˜arm64’: ā€˜m’}>

It builds the module if ARM is detected.

So, nix can use debian annotations to build the kernel?

Definitely doesn’t work when using standard Makefile. I based my .config on the one from DGX OS that came with Spark, so it works, but when I tried to generate one using defconfig, it didn’t even boot.