Failed to run topology matrix (nvidia-smi topo -m)

Hi I am having problem to get the GPU topology using the command nvidia-smi topo -m
I got the following error:
GPU0 GPU1 mlx5_0 CPU Affinity
GPU0 X SOC SOC 48-95
Failed to run topology matrix

I tried to do a gpu-reset in the GPU1 but I got the following output:
Unable to reset this GPU because it’s being used by some other process (e.g. CUDA application, graphics application like X server, monitoring application like other instance of nvidia-smi). Please first kill all processes using this GPU and all compute applications running in the system (even when they are running on other GPUs) and then try to reset the GPU again.
Terminating early due to previous errors.

But there is nothing running in the GPU, as we can see as following:
±-----------------------------------------------------+
| NVIDIA-SMI 352.39 Driver Version: 352.39 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K40m On | 0002:01:00.0 Off | 0 |
| N/A 29C P8 20W / 235W | 22MiB / 11519MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 1 Tesla K40m On | 0006:01:00.0 Off | 0 |
| N/A 26C P8 19W / 235W | 22MiB / 11519MiB | 0% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+

For further information, I am also including the output of the command nvidia-smi -a
Driver Version : 352.39

Attached GPUs : 2
GPU 0002:01:00.0
Product Name : Tesla K40m
Product Brand : Tesla
Display Mode : Disabled
Display Active : Disabled
Persistence Mode : Enabled
Accounting Mode : Disabled
Accounting Mode Buffer Size : 1920
Driver Model
Current : N/A
Pending : N/A
Serial Number : -
GPU UUID : -
Minor Number : 0
VBIOS Version : 80.80.24.00.06
MultiGPU Board : No
Board ID : 0x20100
Inforom Version
Image Version : 2081.0202.01.04
OEM Object : 1.1
ECC Object : 3.0
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
PCI
Bus : 0x01
Device : 0x00
Domain : 0x0002
Device Id : 0x102310DE
Bus Id : 0002:01:00.0
Sub System Id : 0x097E10DE
GPU Link Info
PCIe Generation
Max : 3
Current : 1
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays since reset : 0
Tx Throughput : N/A
Rx Throughput : N/A
Fan Speed : N/A
Performance State : P8
Clocks Throttle Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
Unknown : Not Active
FB Memory Usage
Total : 11519 MiB
Used : 22 MiB
Free : 11497 MiB
BAR1 Memory Usage
Total : 16384 MiB
Used : 2 MiB
Free : 16382 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Ecc Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Total : 0
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Total : 0
Aggregate
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Total : 0
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Total : 0
Retired Pages
Single Bit ECC : 0
Double Bit ECC : 0
Pending : No
Temperature
GPU Current Temp : 29 C
GPU Shutdown Temp : 95 C
GPU Slowdown Temp : 90 C
Power Readings
Power Management : Supported
Power Draw : 20.68 W
Power Limit : 235.00 W
Default Power Limit : 235.00 W
Enforced Power Limit : 235.00 W
Min Power Limit : 150.00 W
Max Power Limit : 235.00 W
Clocks
Graphics : 324 MHz
SM : 324 MHz
Memory : 324 MHz
Applications Clocks
Graphics : 875 MHz
Memory : 3004 MHz
Default Applications Clocks
Graphics : 745 MHz
Memory : 3004 MHz
Max Clocks
Graphics : 875 MHz
SM : 875 MHz
Memory : 3004 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Processes : None

GPU 0006:01:00.0
Product Name : Tesla K40m
Product Brand : Tesla
Display Mode : Disabled
Display Active : Disabled
Persistence Mode : Enabled
Accounting Mode : Disabled
Accounting Mode Buffer Size : 1920
Driver Model
Current : N/A
Pending : N/A
Serial Number : -
GPU UUID : -
Minor Number : 1
VBIOS Version : 80.80.3E.00.01
MultiGPU Board : No
Board ID : 0x60100
Inforom Version
Image Version : 2081.0202.01.04
OEM Object : 1.1
ECC Object : 3.0
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
PCI
Bus : 0x01
Device : 0x00
Domain : 0x0006
Device Id : 0x102310DE
Bus Id : 0006:01:00.0
Sub System Id : 0x097E10DE
GPU Link Info
PCIe Generation
Max : 3
Current : 1
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays since reset : 1
Tx Throughput : N/A
Rx Throughput : N/A
Fan Speed : N/A
Performance State : P8
Clocks Throttle Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
Unknown : Not Active
FB Memory Usage
Total : 11519 MiB
Used : 22 MiB
Free : 11497 MiB
BAR1 Memory Usage
Total : 16384 MiB
Used : 2 MiB
Free : 16382 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Ecc Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Total : 0
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Total : 0
Aggregate
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Total : 0
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Total : 0
Retired Pages
Single Bit ECC : 0
Double Bit ECC : 0
Pending : No
Temperature
GPU Current Temp : 26 C
GPU Shutdown Temp : 95 C
GPU Slowdown Temp : 90 C
Power Readings
Power Management : Supported
Power Draw : 20.00 W
Power Limit : 235.00 W
Default Power Limit : 235.00 W
Enforced Power Limit : 235.00 W
Min Power Limit : 180.00 W
Max Power Limit : 235.00 W
Clocks
Graphics : 324 MHz
SM : 324 MHz
Memory : 324 MHz
Applications Clocks
Graphics : 875 MHz
Memory : 3004 MHz
Default Applications Clocks
Graphics : 745 MHz
Memory : 3004 MHz
Max Clocks
Graphics : 875 MHz
SM : 875 MHz
Memory : 3004 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Processes : None

Try CUDA 8 or a newer driver.

The topo command is done on a best efforts basis. If there is something it can’t determine about the system you are running on, it may not be able to produce results. trying to do a GPU reset isn’t likely to fix anything.

I see, I will try that. Thank you txbob!