Has anyone tried an alternative Linux distro?

graham36 · November 1, 2025, 1:03pm

Looks like NixOS is using the r8169 driver, which is likely an issue!:

[graham@nixos:~]$ lspci | grep Ethernet
0000:01:00.0 Ethernet controller: Mellanox Technologies MT2910 Family [ConnectX-7]
0000:01:00.1 Ethernet controller: Mellanox Technologies MT2910 Family [ConnectX-7]
0002:01:00.0 Ethernet controller: Mellanox Technologies MT2910 Family [ConnectX-7]
0002:01:00.1 Ethernet controller: Mellanox Technologies MT2910 Family [ConnectX-7]
0007:01:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. Device 8127 (rev 05)

[graham@nixos:~]$ lspci -k -s 0007:01:00.0
0007:01:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. Device 8127 (rev 05)
	Subsystem: Realtek Semiconductor Co., Ltd. Device 0123
	Kernel driver in use: r8169
	Kernel modules: r8169

It doesn’t look like my kernel has it:

[graham@nixos:/tmp]$ ls linux-6.17.5/drivers/net/ethernet/realtek/
8139cp.c   atp.c  Kconfig   r8169_firmware.c  r8169.h       r8169_main.c        rtase
8139too.c  atp.h  Makefile  r8169_firmware.h  r8169_leds.c  r8169_phy_config.c

But the NVIDIA kernel referenced above does have it, via an NVIDIA commit:

I’ll try to build the NVIDIA kernel this weekend. Thanks!

eugr · November 1, 2025, 6:33pm

I think you nailed it. No module for r8127, it’s using r8169 on Fedora.
I guess I’ll try to download 8127 from realtek and blacklist 8169.

eugr · November 2, 2025, 7:28am

Well, I did it!

Compiled this branch: GitHub - NVIDIA/NV-Kernels at 24.04_linux-nvidia-6.17-next
It’s based on 6.17.1 which lags behind the latest stable in 6.17 branch, but not as much as 6.11.

I reused config from DGX OS, but set a few parameters:

1. Scheduler controlled preemption model (PREEMPT_LAZY) (NEW)
Module versioning implementation: genksyms
No per-page mapcount (EXPERIMENTAL) (NO_PAGE_MAPCOUNT) [N/y/?] (NEW) - SET TO YES!!!
- looks like this is the most important option as it improves large memory copy operations
Enable r8127 as module (disabled by default)
CPU Power Management/CPU Frequency scaling/Default CPUFreq governor: schedutil (CONFIG_CPU_FREQ_DEFAULT_GOV_SCHEDUTIL=y) so the CPU frequency scales down when idle

Left the rest as default. I think I turned on a few related to Mellanox stuff and some ARM-related, but don’t remember which ones.

Need to unset these ones if not using Ubuntu: CONFIG_SYSTEM_TRUSTED_KEYS, CONFIG_SYSTEM_REVOCATION_KEYS (or set to new values).

Either disable r8169 in config, or in/etc/modprobe.d/blacklist.conf:

blacklist r8169

Otherwise it will load by default.

Compile, install kernel, sudo poweroff, then power on to avoid Ethernet adapter malfunction.
Check that r8127 is loaded via lspci -k | grep r8127

Install NVIDIA drivers.
Reboot. Note that Ethernet is not glitching anymore!

Make sure nvidia-persistenced is enabled, so power management works properly.

So, with all that, I’m getting the same benchmarks in llama.cpp that I was getting in DGX OS, but the model loading is now 4-5x faster! 20 seconds vs more than a minute for gpt-oss-120b!

Benchmarks:

model	size	params	backend	test	t/s
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	pp2048	1956.03 ± 9.28
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	tg32	60.57 ± 0.25
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	pp2048 @ d4096	1637.34 ± 4.86
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	tg32 @ d4096	54.14 ± 0.08
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	pp2048 @ d8192	1512.01 ± 5.66
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	tg32 @ d8192	51.54 ± 0.14
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	pp2048 @ d16384	1307.42 ± 3.79
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	tg32 @ d16384	47.45 ± 0.08
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	pp2048 @ d32768	1027.31 ± 4.79
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	tg32 @ d32768	40.55 ± 0.13

graham36 · November 2, 2025, 7:22pm

Thanks, this is working well on NixOS. The network interface stays up after a reboot. I found I didn’t need to change most of the options you mentioned, since the NVIDIA-provided defconfig seemed reasonable. The only ones I overrode were:

CONFIG_FAULT_INJECTION = lib.mkForce no; # fault injection might add overhead/don't think I need it?
CONFIG_SECURITY_APPARMOR_RESTRICT_USERNS = yes; # NixOS enables AppArmor by default
CONFIG_UBUNTU_HOST = no; # Not Ubuntu!

Here’s the config if anyone’s interested: nixos-dgx-spark/modules/dgx-spark.nix at main · graham33/nixos-dgx-spark · GitHub

graham36 · November 2, 2025, 7:23pm

@eugr what benchmark are you using above exactly? I’d be interested to see if I can reproduce your numbers. Thanks.

eugr · November 2, 2025, 7:45pm

Sure, here we go (first, build llama.cpp from the source):

To test model loading time and download the model from Huggingface:

build/bin/llama-server -hf ggml-org/gpt-oss-120b-GGUF -fa 1 -ngl 999 -ub 2048 -c 0 --jinja --reasoning-format auto --no-mmap

Then run benchmark:

build/bin/llama-bench -m ~/.cache/llama.cpp/ggml-org_gpt-oss-120b-GGUF_gpt-oss-120b-mxfp4-00001-of-00003.gguf -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048 -mmp 0

eugr · November 2, 2025, 11:24pm

So, 6.17.1 solved model loading issues without mmap, but mmap performance is still mediocre :( Not a problem with llama.cpp, but absolutely a problem with pyTorch based workflows, like vllm.

Now you can speed it up by using --safetensors-load-strategy eager, but then vllm process takes a lot of additional CPU memory (like 1/4 of the overall model size). For Qwen3-Next-80B in FP8 quant, that’s additional 20GB of RAM!

I created a separate post about it - want to know if NVIDIA is aware and whether there are any plans on fixing this.

eugr · November 4, 2025, 12:41am

I’m not familiar with NixOS kernel build process, but how does r8127 get included in your kernel if the config option is not set in arm64 defconfig? Am I missing something?

elsaco · November 4, 2025, 2:38am

@eugr please see debian.nvidia-6.17/config/annotations for details:

CONFIG_R8127 policy<{‘amd64’: ‘n’, ‘arm64’: ‘m’}>

It builds the module if ARM is detected.

eugr · November 4, 2025, 6:34am

So, nix can use debian annotations to build the kernel?

Definitely doesn’t work when using standard Makefile. I based my .config on the one from DGX OS that came with Spark, so it works, but when I tried to generate one using defconfig, it didn’t even boot.

Topic		Replies	Views
Some funkiness with my first try at running a linux nvidia driver. Linux	20	4299	September 15, 2015
NVIDIA Support for Latest Linux kernels.. Linux	23	31066	November 6, 2013
Linux 3.10+ Driver crash Linux	186	119972	November 17, 2014
Linux 64-bit driver 173.14.39 for FX-5200 does not compile Linux	19	5985	July 2, 2015
CUDA working on ubuntu-desktop not on ubuntu-server CUDA Programming and Performance	21	19265	March 13, 2014
331.20 WHQL long-term driver discussion Linux	30	14596	November 24, 2013
Need help installing Tesla 1060 on RHEL 5.3 CUDA Programming and Performance	0	3347	May 31, 2009
OpenGL, NVIDIA and Ubuntu 14.04 issues Linux	28	17615	September 22, 2017
nvidia 325.15 for kernel 3.11 crashes on any operation on Quadro K1000M, "gpu has fallen off th Linux	2	2609	September 28, 2013
What is the Least Awful Linux Distro for CUDA development? Is it possible that Windows, or anything, could be worse than Ubuntu? CUDA Setup and Installation	13	15650	March 21, 2018

Has anyone tried an alternative Linux distro?

Related topics