DGX Spark Aerial installer: DOCA/MLNX OFED DKMS fails when both 6.17.0-1014-nvidia and 6.17.0-1018-nvidia kernels are installed

Hi NVIDIA Aerial team,

We are setting up NVIDIA Aerial CUDA-Accelerated RAN on DGX Spark.

During the driver installation, we observed an issue related to the installed kernel versions. The system has both of the following NVIDIA kernels installed:

  • 6.17.0-1014-nvidia
  • 6.17.0-1018-nvidia

When running:

./install_drivers.sh

the installer detects broken DOCA/MLNX OFED packages and runs dpkg/apt recovery. The mlnx-ofed-kernel-dkms package then attempts to build for both installed kernels:

Building for 6.17.0-1014-nvidia and 6.17.0-1018-nvidia

The build for 6.17.0-1014-nvidia succeeds. However, the build for 6.17.0-1018-nvidia fails:

Building initial module mlnx-ofed-kernel/25.10.OFED.25.10.1.7.1.1 for 6.17.0-1018-nvidia

Building module(s) … (bad exit status: 2)
Failed command:
make -j4 KERNELRELEASE=6.17.0-1018-nvidia

Error! Bad return status for module build on kernel: 6.17.0-1018-nvidia (aarch64)

After that, the following packages remain in a broken or unconfigured state:

  • mlnx-ofed-kernel-dkms
  • iser-dkms
  • isert-dkms
  • srp-dkms

This causes dpkg --configure -a and apt --fix-broken install -y to repeatedly retry the same DKMS build sequence. In practice, it becomes a loop:

  1. DKMS builds successfully for 6.17.0-1014-nvidia
  2. DKMS then tries to build for 6.17.0-1018-nvidia
  3. The 6.17.0-1018-nvidia build fails
  4. dpkg remains in a broken state
  5. apt --fix-broken install retries the same process

The Aerial environment appears to expect 6.17.0-1014-nvidia, and the installer dependency check also reports linux-headers-6.17.0-1014-nvidia as fulfilled.

Could you please clarify the recommended kernel baseline for DGX Spark with the current Aerial CUDA-Accelerated RAN release?

Specifically:

  1. Should DGX Spark continue to use 6.17.0-1014-nvidia for Aerial at this stage?
  2. Is 6.17.0-1018-nvidia currently supported by the DOCA/MLNX OFED package used by Aerial?
  3. If 6.17.0-1018-nvidia is not supported yet, should we remove/purge the 1018 kernel and keep only 6.17.0-1014-nvidia?
  4. Do you recommend pinning or holding the 6.17.0-1014-nvidia kernel to prevent automatic upgrade to 6.17.0-1018-nvidia?
  5. Will a future Aerial/DOCA release officially support 6.17.0-1018-nvidia on DGX Spark?

For reference, the failing package version is:

mlnx-ofed-kernel-dkms 25.10.OFED.25.10.1.7.1.1-1

The observed failure directly affects nvidia.service as well, because nvidia-peermem cannot be loaded when the DOCA/OFED stack is not configured successfully.

Any guidance on the supported kernel version and recommended recovery procedure would be appreciated.

Thanks.

Hi @xudong.zhao

yes, Please use 6.17.0-1014-nvidia for Aerial-testbed at this stage. We haven’t tested it with 6.17.0-1018-nvidia yet.

Please disable the kernel automatic upgrade so the all the SW/Drivers built with 6.17.0-1014-nvidia are working with this -1014-nvidia kernel.

Thanks!