DGX Spark Mini – ConnectX‑7 QSFP Ports Not Powering + Diagnostic Tool Unavailable

vamsidhar.r.vurimindi2021 · March 12, 2026, 4:46pm

Hello NVIDIA team,

I am working with two DGX Spark Mini systems and am encountering a persistent issue where the ConnectX‑7 QSFP ports never power on and the nodes cannot establish a link. In addition, the diagnostic tool recommended by NVIDIA support (dgx-spark-fielddiag) cannot be installed because it does not appear in any available repository.

I am requesting engineering guidance on both issues.

1. QSFP Ports Never Power On / No Link Between Nodes

Both DGX Spark Minis show identical behavior:

ConnectX‑7 NIC enumerates correctly in PCIe (lspci shows the device).

mlx5_core loads without errors.

Firmware version is visible.

Cable insertion/removal events appear in dmesg.

QSFP cages never power up.

No network interfaces (p7p1, mlx5_0, etc.) appear in ip link.

Link state remains DOWN at all times.

Key repeated message in logs:

Detected insufficient power on the PCIe slot (27W)

QSFP module not powered

Port module: cable unplugged

This occurs even when the cable is fully inserted.

Cable used

Amphenol NJAAKK‑N911 (0.4m).I understand this is not an NVIDIA‑qualified QSFP112 cable, but even with an unsupported cable, the QSFP cage should still power on if PCIe power delivery is correct.

Because both DGX Spark Minis show the same 27W PCIe power limit and identical QSFP behavior, this appears to be a platform‑level PCIe power delivery issue rather than a single faulty NIC.

2. Diagnostic Tool Requested by NVIDIA Support Cannot Be Installed

NVIDIA support requested that I run the DGX Spark diagnostic tool (dgx-spark-fielddiag).However, after restoring APT sources and running:

sudo apt update

sudo apt install dgx-spark-fielddiag

APT returns:

E: Unable to locate package dgx-spark-fielddiag

Repository state

The only DGX-related repo present is dgx.sources in /etc/apt/sources.list.d/.

This repo provides packages such as dgx-repo, dgx-spark-mlnx-hotplug, dgx-spark-oobe-customize, etc.

dgx-spark-fielddiag is not present in this repository.

Attempting to use the URL:

https://developer.download.nvidia.com/dgx/repos/spark/ubuntu

results in:

The repository does not have a Release file

and APT disables it.

This suggests the diagnostic tool is part of the DGX Spark OS factory image or a private repository, not the public DGX repo. Since this system no longer has the factory OS image, I cannot install the diagnostic tool required for your troubleshooting workflow.

3. Request for Engineering Guidance

I need assistance with the following:

Can a 27W PCIe power limit on DGX Spark Mini prevent the ConnectX‑7 QSFP112 cage from powering on?

Should the Spark Mini supply the full PCIe power budget required for QSFP112 modules?

Is there a BIOS, firmware, or platform configuration required to enable full PCIe power?

Does this behavior indicate a hardware issue with the DGX Spark Mini motherboard or PCIe power delivery?

How can I obtain the DGX Spark Mini OS recovery image or the private repository that contains dgx-spark-fielddiag so I can run the diagnostics you requested?

I can provide full dmesg, lspci -vv, NIC firmware logs, and system snapshots if needed.

Thank you — I appreciate any guidance from the DGX Spark engineering team to determine whether this is a platform power issue, firmware issue, or hardware fault, and how to restore the diagnostic environment.

NVES · March 12, 2026, 4:51pm

Please reference step 1 and 2 on https://nvidia.custhelp.com/app/answers/detail/a_id/5767/~/nvidia-dgx-spark-field-diagnostics to update the appropriate repos.

vamsidhar.r.vurimindi2021 · March 12, 2026, 4:56pm

I followed the steps laid out in this post, and the result is that I have posted.

vamsidhar.r.vurimindi2021 · March 12, 2026, 5:04pm

dgxspark@spark-dc77:~$ sudo mkdir -p /usr/share/keyrings
curl -fsSL https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/sbsa/cuda-archive-keyring.gpg |
sudo tee /usr/share/keyrings/cuda-archive-keyring.gpg > /dev/null
[sudo] password for dgxspark:
dgxspark@spark-dc77:~$ echo “deb [signed-by=/usr/share/keyrings/cuda-archive-keyring.gpg]
Index of /compute/cuda/repos/ubuntu2404/sbsa /” |
sudo tee /etc/apt/sources.list.d/cuda-sbsa-ubuntu2404.list
deb [signed-by=/usr/share/keyrings/cuda-archive-keyring.gpg] Index of /compute/cuda/repos/ubuntu2404/sbsa /
dgxspark@spark-dc77:~$ sudo apt update
Hit:1 Index of /compute/cuda/repos/ubuntu2404/sbsa InRelease
Hit:2 https://workbench.download.nvidia.com/stable/linux/debian default InRelease
Get:3 Index of /apps/ubuntu/ noble-apps-security InRelease [8,371 B]
Ign:4 https://snapshot.ppa.launchpadcontent.net/canonical-nvidia/linux-firmware-mbssid-patches/ubuntu noble InRelease
Hit:5 http://ports.ubuntu.com/ubuntu-ports noble InRelease
Get:6 Index of /apps/ubuntu/ noble-apps-updates InRelease [8,220 B]
Get:8 http://ports.ubuntu.com/ubuntu-ports noble-updates InRelease [126 kB]
Get:9 Index of /infra/ubuntu/ noble-infra-security InRelease [8,214 B]
Get:10 Index of /infra/ubuntu/ noble-infra-updates InRelease [8,213 B]
Get:11 Index of /apps/ubuntu/ noble-apps-security/main arm64 Packages [279 kB]
Ign:7 https://developer.download.nvidia.com/dgx/repos/spark/ubuntu noble InRelease
Hit:12 http://ports.ubuntu.com/ubuntu-ports noble-backports InRelease
Get:14 http://ports.ubuntu.com/ubuntu-ports noble-security InRelease [126 kB]
Get:15 http://ports.ubuntu.com/ubuntu-ports noble-updates/main arm64 Packages [1,915 kB]
Get:16 http://ports.ubuntu.com/ubuntu-ports noble-updates/main Translation-en [333 kB]
Get:17 http://ports.ubuntu.com/ubuntu-ports noble-updates/universe arm64 Packages [1,528 kB]
Get:18 http://ports.ubuntu.com/ubuntu-ports noble-updates/universe Translation-en [319 kB]
Get:19 http://ports.ubuntu.com/ubuntu-ports noble-security/main arm64 Packages [1,628 kB]
Get:20 http://ports.ubuntu.com/ubuntu-ports noble-security/universe arm64 c-n-f Metadata [20.2 kB]
Err:13 https://developer.download.nvidia.com/dgx/repos/spark/ubuntu noble Release
404 Not Found [IP: 23.62.33.28 443]
Hit:21 https://snapshot.ppa.launchpadcontent.net/canonical-nvidia/nvidia-desktop-edge/ubuntu noble InRelease
Ign:4 https://snapshot.ppa.launchpadcontent.net/canonical-nvidia/linux-firmware-mbssid-patches/ubuntu noble InRelease
Ign:4 https://snapshot.ppa.launchpadcontent.net/canonical-nvidia/linux-firmware-mbssid-patches/ubuntu noble InRelease
Hit:4 https://snapshot.ppa.launchpadcontent.net/canonical-nvidia/linux-firmware-mbssid-patches/ubuntu noble InRelease
Reading package lists… Done
E: The repository ‘https://developer.download.nvidia.com/dgx/repos/spark/ubuntu noble Release’ does not have a Release file.
N: Updating from such a repository can’t be done securely, and is therefore disabled by default.
N: See apt-secure(8) manpage for repository creation and user configuration details.
dgxspark@spark-dc77:~$ sudo apt install dgx-spark-fieldiag
Reading package lists… Done
Building dependency tree… Done
Reading state information… Done
dgx-spark-fieldiag is already the newest version (1.0.9-1).
0 upgraded, 0 newly installed, 0 to remove and 1 not upgraded.
dgxspark@spark-dc77:~$ dpkg -l | grep dgx-spark-fieldiag
ii dgx-spark-fieldiag 1.0.9-1 arm64 Nvidia DGX Spark fieldiag tool
dgxspark@spark-dc77:~$ sudo dgx-spark-fieldiag
sudo: dgx-spark-fieldiag: command not found
dgxspark@spark-dc77:~$

elsaco · March 12, 2026, 5:22pm

@vamsidhar.r.vurimindi2021 try /opt/nvidia/dgx-spark-fieldiag/partnerdiag

sudo init 3
cd /opt/nvidia/dgx-spark-fieldiag
sudo ./partnerdiag --field

vamsidhar.r.vurimindi2021 · March 12, 2026, 5:30pm

The result is same: it just open TTY1 screen and shows system info, and does not run diagnostics.

aniculescu · March 12, 2026, 5:56pm

Hi @vamsidhar.r.vurimindi2021, it looks like you aren’t properly configuring the package source needed to download the fieldiag from.
Can you share with me the contents of these two files? One of them should have a link to https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/sbsa/

cat /etc/apt/sources.list.d/cuda-compute-repo.sources
cat /etc/apt/sources.list.d/cuda-sbsa-ubuntu2404.list

vamsidhar.r.vurimindi2021 · March 12, 2026, 5:58pm

dgxspark@spark-dc77:~$ sudo apt --reinstall install dgx-spark-fieldiag
[sudo] password for dgxspark:
Reading package lists… Done
Building dependency tree… Done
Reading state information… Done
0 upgraded, 0 newly installed, 1 reinstalled, 0 to remove and 1 not upgraded.
Need to get 807 MB of archives.
After this operation, 0 B of additional disk space will be used.
Get:1 Index of /compute/cuda/repos/ubuntu2404/sbsa dgx-spark-fieldiag 1.0.9-1 [807 MB]
Fetched 807 MB in 57s (14.1 MB/s)
(Reading database … 260888 files and directories currently installed.)
Preparing to unpack …/dgx-spark-fieldiag_1.0.9-1_arm64.deb …
INFO: DGX Spark system detected. Proceeding with installation.
Unpacking dgx-spark-fieldiag (1.0.9-1) over (1.0.9-1) …
Setting up dgx-spark-fieldiag (1.0.9-1) …
Scanning processes…
Scanning processor microcode…
Scanning linux images…

Running kernel seems to be up-to-date.

The processor microcode seems to be up-to-date.

No services need to be restarted.

No containers need to be restarted.

No user sessions are running outdated binaries.

No VM guests are running outdated hypervisor (qemu) binaries on this host.
dgxspark@spark-dc77:~$ sudo dpkg -L dgx-spark-fielddiag | grep bin
dpkg-query: package ‘dgx-spark-fielddiag’ is not installed
Use dpkg --contents (= dpkg-deb --contents) to list archive files contents.
dgxspark@spark-dc77:~$ sudo dpkg -l | grep dgx-spark-fieldiag
ii dgx-spark-fieldiag 1.0.9-1 arm64 Nvidia DGX Spark fieldiag tool
dgxspark@spark-dc77:~$ cat /etc/apt/sources.list.d/cuda-compute-repo.sources
cat: /etc/apt/sources.list.d/cuda-compute-repo.sources: No such file or directory
dgxspark@spark-dc77:~$ cat /etc/apt/sources.list.d/cuda-sbsa-ubuntu2404.list
deb [signed-by=/usr/share/keyrings/cuda-archive-keyring.gpg] Index of /compute/cuda/repos/ubuntu2404/sbsa /
dgxspark@spark-dc77:~$

vamsidhar.r.vurimindi2021 · March 12, 2026, 6:20pm

The DGX Spark Mini OS image is required because the systems are currently running stock Ubuntu 24.04, which does not include the DGX‑specific kernel modules, drivers, firmware helpers, and diagnostic utilities that the ConnectX‑7 NIC and QSFP112 ports depend on. The public CUDA and Ubuntu repositories provide only generic packages, and the dgx-spark-fielddiag package available there is a meta‑package that contains no diagnostic binaries. As a result, the actual DGX diagnostic tool is missing, and the system cannot load the DGX‑specific PCIe power‑budget modules or NIC firmware components that are needed to bring the QSFP ports online.

Both DGX Spark Minis were recovered using the same non‑DGX image, so both systems now behave identically: the QSFP cages never power on, the NIC remains limited to 27W PCIe power, and no DGX network interfaces appear. The diagnostic tool requested by NVIDIA support cannot run because the real tool is only included in the DGX Spark Mini factory OS image, not in any public repository.

To proceed with the required diagnostics and to restore the DGX‑specific networking and PCIe power‑management stack, I need access to the official DGX Spark Mini OS recovery image. This will restore the DGX kernel, Mellanox drivers, firmware utilities, and the full diagnostic environment needed to continue troubleshooting the QSFP connectivity issue.

aniculescu · March 12, 2026, 6:34pm

We cannot support any systems that have a different OS installed. You can install the OS by following the recovery steps: System Recovery — DGX Spark User Guide
However, your previous message and command output indicates the Fieldiag installed succesfully, you should see the installation directory at /opt/nvidia/dgx-spark-fieldiag and run the partnerdiag script as indicated in the documentation.

vamsidhar.r.vurimindi2021 · March 12, 2026, 6:49pm

I have not installed different OS. I have used NVIDIA image to restore factory defaults and that is the not working. I have used:

sudo init 3
cd /opt/nvidia/dgx-spark-fieldiag
sudo ./partnerdiag --field

It just open TTY1 screen and shows system info, and does not run diagnostics.

dgxspark@spark-dc77:~$ sudo dpkg -l | grep dgx-spark-fieldiag
ii dgx-spark-fieldiag 1.0.9-1 arm64 Nvidia DGX Spark fieldiag tool

dgxspark@spark-dc77:~$ sudo dpkg -L dgx-spark-fielddiag | grep bin
dpkg-query: package ‘dgx-spark-fielddiag’ is not installed
Use dpkg --contents (= dpkg-deb --contents) to list archive files contents.

These two are contradicting each other.

aniculescu · March 12, 2026, 7:02pm

That output is expected, you should not see any binaries with the second command you listed.
Do you see this directory? /opt/nvidia/dgx-spark-fieldiag

sudo init 3 brings up the tty console. Did you run the partnerdiag in the tty console?

cd /opt/nvidia/dgx-spark-fieldiag
sudo ./partnerdiag --field

vamsidhar.r.vurimindi2021 · March 12, 2026, 7:12pm

cd /opt/nvidia/dgx-spark-fieldiag results in no such file

elsaco · March 12, 2026, 7:43pm

Package info:

elsaco@spark1:~$ apt info dgx-spark-fieldiag
Package: dgx-spark-fieldiag
Version: 1.0.9-1
Priority: optional
Section: multiverse/devel
Maintainer: cudatools <cudatools@nvidia.com>
Installed-Size: 814 MB
Depends: stress-ng, fio, memtester
Download-Size: 807 MB
APT-Manual-Installed: yes
APT-Sources: https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/sbsa  Packages
Description: Nvidia DGX Spark fieldiag tool
 NVIDIA Field Diagnostic is a powerful software program that is used to test NVIDIA DGX SPARK and detect hardware failures.
 It is intended for health checks of the setup, and a pre-check for RMA qualification of the overall system.
 The software and materials are governed by the LICENSE [/usr/share/doc/dgx-spark-fieldiag/copyright].

Package contents:

elsaco@spark1:~$ dpkg -L dgx-spark-fieldiag
/.
/opt
/opt/nvidia
/opt/nvidia/dgx-spark-fieldiag
/opt/nvidia/dgx-spark-fieldiag/README.txt
/opt/nvidia/dgx-spark-fieldiag/UserGuide-DGX-SPARK-Fieldiag.pdf
/opt/nvidia/dgx-spark-fieldiag/onediagfield.r9.257.3.tgz
/opt/nvidia/dgx-spark-fieldiag/partnerdiag
/opt/nvidia/dgx-spark-fieldiag/relnotes.txt
/opt/nvidia/dgx-spark-fieldiag/spec_dgx_spark_field_level2.json
/usr
/usr/share
/usr/share/doc
/usr/share/doc/dgx-spark-fieldiag
/usr/share/doc/dgx-spark-fieldiag/changelog.Debian.gz
/usr/share/doc/dgx-spark-fieldiag/copyright

@vamsidhar.r.vurimindi2021 if you don’t have these files, you didn’t install the package!

Also, running Ubuntu stock on the DGX Spark is not recommended. Use the recovery image instead.

vamsidhar.r.vurimindi2021 · March 12, 2026, 7:51pm

sudo ./partnerdiag --field

failed to execute and here are the logs for ./unloadnvidiamodule.log, ./cmd.log, ./filehash.log. I have downloaded dgx-spark-recovery-image-1.120.36.tar and used that to restore factory defaults. if you have any other image, I would use it to restore to factory defaults and run diagniotic tool.

dgxspark@spark-dc77:/opt/nvidia/dgx-spark-fieldiag/logs$ cat ./cmd.log
Command Line: onediagfield.r9.257.3 --keep_disp_enabled --force_product=spark --run_spec=spec_dgx_spark_field_level2.json

dgxspark@spark-dc77:/opt/nvidia/dgx-spark-fieldiag/logs$ cat ./unloadnvidiamodule.log
Checking if udevfix WAR is needed
No need for udevfix WAR
Checking and disabling driver persistence mode
Disabled persistence mode for GPU 0000000F:01:00.0.
All done.
Driver persistence mode disabled successfully using nvidia-smi
Checking if rm_driver_discovery process is running
Keeping display enabled
Driver Unload Loop 0/10:
Service nvidia-docker already inactive/not loaded/not a service
Service docker active. unloading
Service docker stopped successfully.
Service nvidia-* active. unloading
Service nvidia-* stopped successfully.
Service nvidia-persistenced already inactive/not loaded/not a service
Service fabricmanager already inactive/not loaded/not a service
Service nvidia-fabricmanager already inactive/not loaded/not a service
Service dcgm already inactive/not loaded/not a service
Service dcgm-exporter already inactive/not loaded/not a service
Service dcgm-exporter.service already inactive/not loaded/not a service
Service nvidia-dcgm already inactive/not loaded/not a service
Service nvidia-nvsm already inactive/not loaded/not a service
Service gdrdrv already inactive/not loaded/not a service
Service nv_peer_mem already inactive/not loaded/not a service
Service nvsm* already inactive/not loaded/not a service
Service systemd-udevd active. unloading
Service systemd-udevd stopped successfully.
Service systemd-udevd-kernel.socket active. unloading
Service systemd-udevd-kernel.socket stopped successfully.
Service systemd-udevd-control.socket active. unloading
Service systemd-udevd-control.socket stopped successfully.
Checking for module: nouveau
Module: nouveau not loaded
Checking for module: nv_peer_mem
Module: nv_peer_mem not loaded
Checking for module: nv_peermem
Module: nv_peermem not loaded
Checking for module: nvidia_peermem
Module: nvidia_peermem not loaded
Checking for module: gdrdrv
Module: gdrdrv not loaded
Checking for module: nvidia_uvm
Attempting to remove module: nvidia_uvm Attempt 1
Removed module: nvidia_uvm successfully
Checking for module: nvidia_drm
Attempting to remove module: nvidia_drm Attempt 1
Attempting to remove module: nvidia_drm Attempt 2
Attempting to remove module: nvidia_drm Attempt 3
Attempting to remove module: nvidia_drm Attempt 4
Attempting to remove module: nvidia_drm Attempt 5
Unable to remove module: nvidia_drm
Driver Unload Loop 1/10:
Service nvidia-docker already inactive/not loaded/not a service
Service docker already inactive/not loaded/not a service
Service nvidia-* already inactive/not loaded/not a service
Service nvidia-persistenced already inactive/not loaded/not a service
Service fabricmanager already inactive/not loaded/not a service
Service nvidia-fabricmanager already inactive/not loaded/not a service
Service dcgm already inactive/not loaded/not a service
Service dcgm-exporter already inactive/not loaded/not a service
Service dcgm-exporter.service already inactive/not loaded/not a service
Service nvidia-dcgm already inactive/not loaded/not a service
Service nvidia-nvsm already inactive/not loaded/not a service
Service gdrdrv already inactive/not loaded/not a service
Service nv_peer_mem already inactive/not loaded/not a service
Service nvsm* already inactive/not loaded/not a service
Service systemd-udevd already inactive/not loaded/not a service
Service systemd-udevd-kernel.socket already inactive/not loaded/not a service
Service systemd-udevd-control.socket already inactive/not loaded/not a service
Checking for module: nouveau
Module: nouveau not loaded
Checking for module: nv_peer_mem
Module: nv_peer_mem not loaded
Checking for module: nv_peermem
Module: nv_peermem not loaded
Checking for module: nvidia_peermem
Module: nvidia_peermem not loaded
Checking for module: gdrdrv
Module: gdrdrv not loaded
Checking for module: nvidia_uvm
Module: nvidia_uvm not loaded
Checking for module: nvidia_drm
Attempting to remove module: nvidia_drm Attempt 1
Attempting to remove module: nvidia_drm Attempt 2
Attempting to remove module: nvidia_drm Attempt 3
Attempting to remove module: nvidia_drm Attempt 4
Attempting to remove module: nvidia_drm Attempt 5
Unable to remove module: nvidia_drm
Driver Unload Loop 2/10:
Service nvidia-docker already inactive/not loaded/not a service
Service docker already inactive/not loaded/not a service
Service nvidia-* already inactive/not loaded/not a service
Service nvidia-persistenced already inactive/not loaded/not a service
Service fabricmanager already inactive/not loaded/not a service
Service nvidia-fabricmanager already inactive/not loaded/not a service
Service dcgm already inactive/not loaded/not a service
Service dcgm-exporter already inactive/not loaded/not a service
Service dcgm-exporter.service already inactive/not loaded/not a service
Service nvidia-dcgm already inactive/not loaded/not a service
Service nvidia-nvsm already inactive/not loaded/not a service
Service gdrdrv already inactive/not loaded/not a service
Service nv_peer_mem already inactive/not loaded/not a service
Service nvsm* already inactive/not loaded/not a service
Service systemd-udevd already inactive/not loaded/not a service
Service systemd-udevd-kernel.socket already inactive/not loaded/not a service
Service systemd-udevd-control.socket already inactive/not loaded/not a service
Checking for module: nouveau
Module: nouveau not loaded
Checking for module: nv_peer_mem
Module: nv_peer_mem not loaded
Checking for module: nv_peermem
Module: nv_peermem not loaded
Checking for module: nvidia_peermem
Module: nvidia_peermem not loaded
Checking for module: gdrdrv
Module: gdrdrv not loaded
Checking for module: nvidia_uvm
Module: nvidia_uvm not loaded
Checking for module: nvidia_drm
Attempting to remove module: nvidia_drm Attempt 1
Attempting to remove module: nvidia_drm Attempt 2
Attempting to remove module: nvidia_drm Attempt 3
Attempting to remove module: nvidia_drm Attempt 4
Attempting to remove module: nvidia_drm Attempt 5
Unable to remove module: nvidia_drm
Driver Unload Loop 3/10:
Service nvidia-docker already inactive/not loaded/not a service
Service docker already inactive/not loaded/not a service
Service nvidia-* already inactive/not loaded/not a service
Service nvidia-persistenced already inactive/not loaded/not a service
Service fabricmanager already inactive/not loaded/not a service
Service nvidia-fabricmanager already inactive/not loaded/not a service
Service dcgm already inactive/not loaded/not a service
Service dcgm-exporter already inactive/not loaded/not a service
Service dcgm-exporter.service already inactive/not loaded/not a service
Service nvidia-dcgm already inactive/not loaded/not a service
Service nvidia-nvsm already inactive/not loaded/not a service
Service gdrdrv already inactive/not loaded/not a service
Service nv_peer_mem already inactive/not loaded/not a service
Service nvsm* already inactive/not loaded/not a service
Service systemd-udevd already inactive/not loaded/not a service
Service systemd-udevd-kernel.socket already inactive/not loaded/not a service
Service systemd-udevd-control.socket already inactive/not loaded/not a service
Checking for module: nouveau
Module: nouveau not loaded
Checking for module: nv_peer_mem
Module: nv_peer_mem not loaded
Checking for module: nv_peermem
Module: nv_peermem not loaded
Checking for module: nvidia_peermem
Module: nvidia_peermem not loaded
Checking for module: gdrdrv
Module: gdrdrv not loaded
Checking for module: nvidia_uvm
Module: nvidia_uvm not loaded
Checking for module: nvidia_drm
Attempting to remove module: nvidia_drm Attempt 1
Attempting to remove module: nvidia_drm Attempt 2
Attempting to remove module: nvidia_drm Attempt 3
Attempting to remove module: nvidia_drm Attempt 4
Attempting to remove module: nvidia_drm Attempt 5

dgxspark@spark-dc77:/opt/nvidia/dgx-spark-fieldiag/logs$ cat ./filehash.log
tests/scripts/core/dgx_common.py: OK
tests/scripts/core/dgx_curl.py: OK
tests/scripts/core/dgx_eud_common.py: OK
tests/scripts/core/dgx_mods_errors.py: OK
tests/scripts/core/dgx_test_status_logger.py: OK
tests/scripts/core/dgx_gpu_common.py: OK
tests/scripts/core/dgx_fielddiag_common.py: OK
tests/scripts/core/dgx_errorcodes.py: OK
tests/scripts/core/init.py: OK
tests/scripts/core/dgx_mods_error_code_processor.py: OK
tests/scripts/core/dgx_dcdiagtest.py: OK
tests/scripts/core/dgx_osm.py: OK
tests/scripts/core/dgx_nbu_common.py: OK
tests/scripts/core/dgx_texr.py: OK
tests/scripts/core/dgx_banner.py: OK
tests/scripts/core/dgx_types.py: OK
tests/scripts/core/dgx_test_category.py: OK
tests/scripts/core/dgx_product.py: OK
tests/scripts/core/dgx_mods_tests.py: OK
tests/scripts/core/dgx_hasbug.py: OK
tests/scripts/core/multinode/dgx_argparse.py: OK
tests/scripts/core/multinode/dgx_cluster.py: OK
tests/scripts/core/multinode/dgx_gdm_config.py: OK
tests/scripts/core/multinode/dgx_switchtelemetry.py: OK
tests/scripts/core/multinode/gdm_pb2.py: OK
tests/scripts/core/multinode/init.py: OK
tests/scripts/core/multinode/dgx_gdm_client.py: OK
tests/scripts/core/multinode/dgx_heartbeat.py: OK
tests/scripts/core/multinode/dgx_node.py: OK
tests/scripts/core/dgx_error_code_processor.py: OK
tests/scripts/core/dra/init.py: OK
tests/scripts/core/dra/dgx_dra.py: OK
tests/scripts/core/dra/messages_pb2.py: OK
tests/scripts/core/bglogger/dgx_bgmonitor.py: OK
tests/scripts/core/bglogger/dgx_bglogger_mle.py: OK
tests/scripts/core/bglogger/monitors/dgx_bgmonitor_example.py: OK
tests/scripts/core/bglogger/monitors/dgx_bgmon_cx7.py: OK
tests/scripts/core/bglogger/monitors/dgx_bgmon_i2c.py: OK
tests/scripts/core/bglogger/dgx_bglogger.py: OK
tests/scripts/core/dgx_unified_json.py: OK
tests/scripts/core/dgx_logging.py: OK
tests/scripts/core/dgx_tracer.py: OK
tests/scripts/utilities/dgx_mse_utility.py: OK
tests/scripts/utilities/dgx_dra_utility.py: OK
tests/scripts/utilities/dgx_log_analysis.py: OK
tests/scripts/utilities/dgx_dimm_utility.py: OK
tests/scripts/utilities/bmc/dgx_bmc_redfish.py: OK
tests/scripts/utilities/bmc/dgx_bmc_oem_dgx2_bb.py: OK
tests/scripts/utilities/bmc/dgx_bmc.py: OK
tests/scripts/utilities/bmc/dgx_bmc_oem_dgx2.py: OK
tests/scripts/utilities/bmc/init.py: OK
tests/scripts/utilities/bmc/dgx_bmc_oem_deltanext.py: OK
tests/scripts/utilities/bmc/dgx_bmc_oem_hgx2testernext.py: OK
tests/scripts/utilities/bmc/dgx_bmc_oem_vulcannextb300a.py: OK
tests/scripts/utilities/bmc/dgx_bmcutil_datalogger.py: OK
tests/scripts/utilities/bmc/dgx_bmcutil_main.py: OK
tests/scripts/utilities/bmc/dgx_bmc_oem_vulcannext.py: OK
tests/scripts/utilities/bmc/dgx_fan_utility.py: OK
tests/scripts/utilities/bmc/dgx_bmcutil_utilities.py: OK
tests/scripts/utilities/bmc/dgx_bmc_sensor_thresholds.py: OK
tests/scripts/utilities/bmc/dgx_bmc_oem_luna.py: OK
tests/scripts/utilities/bmc/dgx_bmcutil_mlelogger.py: OK
tests/scripts/utilities/bmc/dgx_bmcutil_exceptions.py: OK
tests/scripts/utilities/dgx_tegra_utility.py: OK
tests/scripts/utilities/dgx_retimer_utility.py: OK
tests/scripts/utilities/dgx_mla_utility.py: OK
tests/scripts/utilities/dgx_nvue_cli.py: OK
tests/scripts/utilities/dgx_chipsdb.py: OK
tests/scripts/utilities/dgx_decorators.py: OK
tests/scripts/utilities/init.py: OK
tests/scripts/utilities/dgx_bg_logger.py: OK
tests/scripts/utilities/dgx_ib_common.py: OK
tests/scripts/utilities/dgx_chkoccurrences_testargs_schema.py: OK
tests/scripts/utilities/dgx_chkoccurrences_utility.py: OK
tests/scripts/utilities/dgx_diag_sot_utility.py: OK
tests/scripts/utilities/dgx_os_utility.py: OK
tests/scripts/utilities/dgx_libevent_utility.py: OK
tests/scripts/utilities/dgx_repair_utility.py: OK
tests/scripts/utilities/dgx_fpgareg_utility.py: OK
tests/scripts/utilities/dgx_utility.py: OK
tests/scripts/utilities/dgx_skucheck_utilities.py: OK
tests/scripts/utilities/dgx_inforom_utility.py: OK
tests/scripts/utilities/dgx_tdm_utility.py: OK
tests/scripts/utilities/dgx_bg_logger_gpu_retimer.py: OK
tests/scripts/utilities/dgx_testspec_validation.py: OK
tests/scripts/utilities/dgx_paramiko_utility.py: OK
tests/scripts/utilities/io/dgx_pcie_utility.py: OK
tests/scripts/utilities/io/init.py: OK
tests/scripts/utilities/io/dgx_nvswitch_common.py: OK
tests/scripts/utilities/io/dgx_nvlink_utility.py: OK
tests/scripts/utilities/io/dgx_nvl_ec_types.py: OK
tests/scripts/utilities/io/dgx_nvl_ec_processor.py: OK
tests/scripts/systemhal/dgx_system_nero.py: OK
tests/scripts/systemhal/dgx_system_axion.py: OK
tests/scripts/systemhal/dgx_system_lego_c2.py: OK
tests/scripts/systemhal/dgx_system_hgx2_tester_next.py: OK
tests/scripts/systemhal/dgx_system_titania_gb110.py: OK
tests/scripts/systemhal/dgx_system_bianca_gb110.py: OK
tests/scripts/systemhal/dgx_system_nero_bw.py: OK
tests/scripts/systemhal/dgx_system_dgx.py: OK
tests/scripts/systemhal/dgx_system_cgx.py: OK
tests/scripts/systemhal/dgx_system_redstone_next.py: OK
tests/scripts/systemhal/dgx_system_falcon.py: OK
tests/scripts/systemhal/dgx_system_bladerunner2.py: OK
tests/scripts/systemhal/dgx_system_zhora.py: OK
tests/scripts/systemhal/dgx_system_bianca.py: OK
tests/scripts/systemhal/dgx_system_keystone.py: OK
tests/scripts/systemhal/dgx_system_freysa.py: OK
tests/scripts/systemhal/dgx_system_bladerunner.py: OK
tests/scripts/systemhal/dgx_system_dgxa100_next_dp.py: OK
tests/scripts/systemhal/dgx_system_cordelia.py: OK
tests/scripts/systemhal/dgx_system_alon.py: OK
tests/scripts/systemhal/dgx_system_oberon_gb.py: OK
tests/scripts/systemhal/dgx_system.py: OK
tests/scripts/systemhal/init.py: OK
tests/scripts/systemhal/dgx_system_cordelia_gb.py: OK
tests/scripts/systemhal/dgx_system_common.py: OK
tests/scripts/systemhal/dgx_system_zabawa.py: OK
tests/scripts/systemhal/dgx_system_dgxa100_next.py: OK
tests/scripts/systemhal/dgx_system_hgx2.py: OK
tests/scripts/systemhal/dgx_system_gora.py: OK
tests/scripts/systemhal/dgx_system_sgxa100.py: OK
tests/scripts/systemhal/dgx_system_mab.py: OK
tests/scripts/systemhal/dgx_system_vulcan_next.py: OK
tests/scripts/systemhal/dgx_system_spectrum.py: OK
tests/scripts/systemhal/dgx_system_dgxh100_next.py: OK
tests/scripts/systemhal/dgx_system_benthos.py: OK
tests/scripts/systemhal/dgx_system_starship.py: OK
tests/scripts/systemhal/dgx_system_delta_next.py: OK
tests/scripts/systemhal/dgx_system_ranger.py: OK
tests/scripts/systemhal/dgx_system_custom_command.py: OK
tests/scripts/systemhal/dgx_system_juliet_ferdinand.py: OK
tests/scripts/systemhal/dgx_system_merlin.py: OK
tests/scripts/systemhal/dgx_system_mercury.py: OK
tests/scripts/systemhal/dgx_system_aja.py: OK
tests/scripts/systemhal/dgx_system_planck.py: OK
tests/scripts/systemhal/dgx_system_bladerunner3.py: OK
tests/scripts/systemhal/dgx_system_goldstone_next.py: OK
tests/scripts/systemhal/dgx_system_dgxh100_next_b300.py: OK
tests/scripts/systemhal/dgx_system_ut2_1.py: OK
tests/scripts/systemhal/dgx_system_vulcan_next_b300a.py: OK
tests/scripts/systemhal/dgx_system_oberon.py: OK
tests/scripts/systemhal/dgx_system_royb.py: OK
tests/scripts/systemhal/dgx_system_skinny_joe.py: OK
tests/scripts/systemhal/dgx_system_generic.py: OK
tests/scripts/systemhal/dgx_system_oberon_gb110.py: OK
tests/scripts/systemhal/dgx_system_lunadp.py: OK
tests/scripts/systemhal/dgx_system_galaxy.py: OK
tests/scripts/systemhal/dgx_system_titania_gb.py: OK
tests/scripts/dgx_run.py: OK
tests/scripts/tests/dgx_repair.py: OK
tests/scripts/tests/dgx_skucheck_datalogger.py: OK
tests/scripts/tests/dgx_cpustress.py: OK
tests/scripts/tests/dgx_inventory.py: OK
tests/scripts/tests/dgx_inforom.py: OK
tests/scripts/tests/dgx_bglogger_control.py: OK
tests/scripts/tests/dgx_custom.py: OK
tests/scripts/tests/dgx_thermal.py: OK
tests/scripts/tests/dgx_skucheck_exceptions.py: OK
tests/scripts/tests/bmc/dgx_fan.py: OK
tests/scripts/tests/bmc/init.py: OK
tests/scripts/tests/dgx_skucheck_functionhandlers.py: OK
tests/scripts/tests/init.py: OK
tests/scripts/tests/dgx_powersync.py: OK
tests/scripts/tests/dgx_nvosphyslinkcheck.py: OK
tests/scripts/tests/dgx_power.py: OK
tests/scripts/tests/dgx_dimmecc.py: OK
tests/scripts/tests/dgx_ssd.py: OK
tests/scripts/tests/dgx_nvdriver.py: OK
tests/scripts/tests/dgx_checkinforom.py: OK
tests/scripts/tests/dgx_gpustress.py: OK
tests/scripts/tests/dgx_gpuperfswitch.py: OK
tests/scripts/tests/dgx_nbusensorcheck.py: OK
tests/scripts/tests/dgx_telemetry.py: OK
tests/scripts/tests/dgx_skucheck.py: OK
tests/scripts/tests/dgx_perf.py: OK
tests/scripts/tests/dgx_chkoccurrences.py: OK
tests/scripts/tests/dgx_skucheck_main.py: OK
tests/scripts/tests/dgx_cudacores.py: OK
tests/scripts/tests/dgx_nvoshealthcheck.py: OK
tests/scripts/tests/dgx_gpumemeccstress.py: OK
tests/scripts/tests/nbu/init.py: OK
tests/scripts/tests/nbu/cx8/init.py: OK
tests/scripts/tests/nbu/dgx_dpudiag.py: OK
tests/scripts/tests/nbu/ib/init.py: OK
tests/scripts/tests/dgx_gpumem.py: OK
tests/scripts/tests/dgx_ssdhealthcheck.py: OK
tests/scripts/tests/dgx_custommods.py: OK
tests/scripts/tests/dgx_thermalres.py: OK
tests/scripts/tests/dgx_video.py: OK
tests/scripts/tests/dgx_gpu_fieldiag.py: OK
tests/scripts/tests/dgx_ist.py: OK
tests/scripts/tests/io/dgx_ibstressmad.py: OK
tests/scripts/tests/io/dgx_connectivity.py: OK
tests/scripts/tests/io/dgx_theta.py: OK
tests/scripts/tests/io/dgx_cxeyegrade.py: OK
tests/scripts/tests/io/dgx_cable_cartridge.py: OK
tests/scripts/tests/io/init.py: OK
tests/scripts/tests/io/dgx_nvswitch.py: OK
tests/scripts/tests/io/dgx_ibmodechange.py: OK
tests/scripts/tests/io/dgx_mopir.py: OK
tests/scripts/tests/io/dgx_nvlink.py: OK
tests/scripts/tests/io/dgx_c2c.py: OK
tests/scripts/tests/io/dgx_aer.py: OK
tests/scripts/tests/io/dgx_network_enum.py: OK
tests/scripts/tests/io/dgx_pcieproperties.py: OK
tests/scripts/tests/io/dgx_ibstress.py: OK
tests/scripts/tests/io/dgx_pcie.py: OK
tests/scripts/system_mle/init.py: OK
tests/scripts/system_mle/mle.py: OK
tests/driver/mods_krnl.c: OK
tests/driver/COPYING: OK
tests/driver/mods.h: OK
tests/driver/mods_mem.c: OK
tests/driver/Makefile: OK
tests/driver/mods_debugfs.c: OK
tests/driver/mods_irq.c: OK
tests/driver/mods_internal.h: OK
tests/driver/hash: OK
tests/driver/mods_acpi.c: OK
tests/driver/mods_config.h: OK
tests/driver/mods_pci.c: OK
tests/driver/README: OK
tests/driver/mods_arm_ffa.c: OK
tests/driver/mods_bpmpipc.c: OK
tests/mods.580/gdm: OK
tests/mods.580/cvb_ist_gb110_0.yme: OK
tests/mods.580/nvdec.bin: OK
tests/mods.580/pg520sku280_nvspec.jsone: OK
tests/mods.580/centralized_thresholds.jsone: OK
tests/mods.580/gdm_config.json: OK
tests/mods.580/pg520sku205_nvspec.jsone: OK
tests/mods.580/version_checker: OK
tests/mods.580/libnvgpucomp.so: OK
tests/mods.580/fieldiag: OK
tests/mods.580/nvjpg.bin: OK
tests/mods.580/driver.tgz: OK
tests/mods.580/pg530sku215_nvspec.jsone: OK
tests/mods.580/pg520sku238_nvspec.jsone: OK
tests/mods.580/p1010sku210_nvspec.jsone: OK
tests/mods.580/repair.spe: OK
tests/mods.580/pg520sku266_nvspecs.jsone: OK
tests/mods.580/install_module.sh: OK
tests/mods.580/cask_sm103.bin: OK
tests/mods.580/pg520sku235_nvspec.jsone: OK
tests/mods.580/pg520sku292_nvspec.jsone: OK
tests/mods.580/libmultinode_transport.so: OK
tests/mods.580/pg520sku237_nvspec.jsone: OK
tests/mods.580/gb1xx_ist_fs_encoding.jsone: OK
tests/mods.580/pg520sku200_nvspec.jsone: OK
tests/mods.580/default.bin: OK
tests/mods.580/multifieldiag.sh: OK
tests/mods.580/pg520sku282_nvspec.jsone: OK
tests/mods.580/libstdc++.so.6: OK
tests/mods.580/check_config.spe: OK
tests/mods.580/pg520sku202_nvspec.jsone: OK
tests/mods.580/nvlinktopofiles.bin: OK
tests/mods.580/cuda.bin: OK
tests/mods.580/libssl.so.3: OK
tests/mods.580/nvlinktopofiles.onediag.bin: OK
tests/mods.580/check_config.sh: OK
tests/mods.580/pg530sku200_nvspec.jsone: OK
tests/mods.580/gpu_diag.spe: OK
tests/mods.580/pg520sku213_nvspec.jsone: OK
tests/mods.580/cask_sm121.bin: OK
tests/mods.580/p1010sku215_nvspec.jsone: OK
tests/mods.580/libcrypto.so.3: OK
tests/mods.580/p1010sku205_nvspec.jsone: OK
tests/mods.580/p1010sku200_nvspec.jsone: OK
tests/mods.580/pg520sku236_nvspec.jsone: OK
tests/mods.580/nvofa.bin: OK
tests/mods.580/librtcore.so: OK
tests/mods.580/cask_sm100.bin: OK
tests/mods.580/pg530sku206_nvspec.jsone: OK
tests/mods.580/pg133sku266_nvspec.jsone: OK
tests/mods.580/pg520sku207_nvspec.jsone: OK
tests/mods.580/cask_sm120.bin: OK
tests/mods.580/pg530sku203_nvspec.jsone: OK
tests/mods.580/gpu_fd.spe: OK
tests/mods.580/ist.spe: OK
tests/mods.580/libspirv.so: OK
tests/mods.580/dgxfielddiag.spe: OK
tests/mods.580/pg530sku204_nvspec.jsone: OK
tests/mods.580/pg520sku221_nvspec.jsone: OK
tests/mods.580/pg520sku228_nvspec.jsone: OK

vamsidhar.r.vurimindi2021 · March 12, 2026, 7:56pm

I have the following: 
elsaco@spark1:~$ apt info dgx-spark-fieldiag
Two versions are there:
Version: 1.0.7-1
Version: 1.0.8-1


dgxspark@spark-dc77:/$ dpkg -L dgx-spark-fieldiag
/.
/opt
/opt/nvidia
/opt/nvidia/dgx-spark-fieldiag
/opt/nvidia/dgx-spark-fieldiag/README.txt
/opt/nvidia/dgx-spark-fieldiag/UserGuide-DGX-SPARK-Fieldiag.pdf
/opt/nvidia/dgx-spark-fieldiag/onediagfield.r9.257.3.tgz
/opt/nvidia/dgx-spark-fieldiag/partnerdiag
/opt/nvidia/dgx-spark-fieldiag/relnotes.txt
/opt/nvidia/dgx-spark-fieldiag/spec_dgx_spark_field_level2.json
/usr
/usr/share
/usr/share/doc
/usr/share/doc/dgx-spark-fieldiag
/usr/share/doc/dgx-spark-fieldiag/changelog.Debian.gz
/usr/share/doc/dgx-spark-fieldiag/copyright

aniculescu · March 12, 2026, 8:37pm

To go back to your original issue with the CX7 ports, this message is a non-issue and does not represent an error.
Detected insufficient power on the PCIe slot (27W)
See this post: PCIe power related and PCIe AER errors

After performing the reimage, can you plug in the the QSFP cable and run ibdev2netdev and see if any of the ports report ‘Up’ state?

vamsidhar.r.vurimindi2021 · March 12, 2026, 9:10pm

I have not checked the post: [PCIe power related and PCIe AER errors ], but simply issue the ibdev2netdev command and here is the output. I will read the post tonight and get back to you tomorrow.

dgxspark@spark-dc77:/$ ibdev2netdev
rocep1s0f0 port 1 ==> enp1s0f0np0 (Up)
rocep1s0f1 port 1 ==> enp1s0f1np1 (Down)
roceP2p1s0f0 port 1 ==> enP2p1s0f0np0 (Up)
roceP2p1s0f1 port 1 ==> enP2p1s0f1np1 (Down)

aniculescu · March 12, 2026, 10:19pm

It looks like the CX7 is recognized and you can configure your network for multinode workloads. Please follow the playbook to set it up and test it

vamsidhar.r.vurimindi2021 · March 13, 2026, 6:36pm

Hello NVIDIA Support Team,

Subject: DGX Spark Mini – QSFP112 Ports Reporting 200 Gb/s but YAML Network Config Uses 25 GbE (Need Guidance)

I am configuring two DGX Spark Mini systems following the official playbook and YAML network configuration. I have encountered a discrepancy between the physical NIC capabilities and the network behavior, and I would like your guidance on the correct configuration for the QSFP112 fabric.

Summary of the issue

Both Spark systems expose the following interfaces:

enp7s7 – 2.5 GbE switch port (Speed: 2500 Mb/s)

enp1s0f0np0 – ConnectX‑7 port (Speed: 200000 Mb/s)

enp1s0f1np1 – CX7 lane (DOWN)

enP2p1s0f0np0 – ConnectX‑7 port (Speed: 200000 Mb/s)

enP2p1s0f1np1 – CX7 lane (DOWN)

This matches the hardware layout on both Spark1 and Spark2.

Behavior with the NVIDIA‑provided YAML

The YAML from the playbook assigns IPs in the 192.168.100.x range to:

enp1s0f0np0

enP2p1s0f0np0

When I run iperf3 between these IPs, we consistently see:

~36 Gb/s total throughput

High retransmissions

Behavior consistent with a 25 GbE network, not 200 Gb/s QSFP112

Evidence

ethtool reports Speed: 200000 Mb/s on both QSFP interfaces.

iperf3 over 192.168.100.x never exceeds ~36 Gb/s.

Both Spark nodes show identical interface names and speeds.

The QSFP112 DAC cable is connected to the first QSFP port on both systems.

My question

Could you please confirm:

Whether the playbook’s YAML is intended to configure the 25 GbE network only, and not the QSFP112 fabric?

What is the correct method to assign IPs to the 200 Gb/s QSFP112 ports for fabric testing?

Whether additional DGX‑specific drivers, firmware, or configuration steps are required to enable full 200 Gb/s performance on the QSFP ports?

I want to ensure I am following the official DGX Spark Mini configuration path and not deviating from the supported setup.

Thank you for your guidance.

Best regards,

Vamsidhar

Topic		Replies	Views
ConnectX-7 NIC's no longer appear DGX Spark / GB10 pcie , nics , lspci	10	200	March 13, 2026
Connect multiple dxgs DGX Spark / GB10 spark	7	663	November 13, 2025
My QSFP ports on my DGX Spark are not working DGX Spark / GB10 jetson , port	12	175	March 23, 2026
ConnectX-7 NIC in DGX Spark DGX Spark / GB10	67	4008	December 2, 2025
One of Four DGX Sparks Shows ~35% Lower NCCL Bandwidth — Can't Figure Out Why DGX Spark / GB10	49	1127	March 17, 2026
Suggested cable to link two Sparks? DGX Spark / GB10	77	5695	December 8, 2025
DGX Spark setup DGX Spark / GB10 system-setup	3	516	December 1, 2025
Network issues when connecting two DGX Spark systems via QSFP using “Connect Two Sparks” playbook DGX Spark / GB10 dgx	3	234	January 26, 2026
Has anyone tried an alternative Linux distro? DGX Spark / GB10	62	3034	December 28, 2025
Hardware issue DGX Spark / GB10 cuda , kernel	9	461	December 31, 2025

DGX Spark Mini – ConnectX‑7 QSFP Ports Not Powering + Diagnostic Tool Unavailable

Related topics