GH100 deviceQuery got cudaGetDeviceCount returned 802

image
nvidia-bug-report.log.gz (1008.3 KB)

CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 802
→ system not yet initialized
Result = FAIL

nvidia-smi -a

==============NVSMI LOG==============

Timestamp : Sun Mar 3 20:10:45 2024
Driver Version : 535.161.07
CUDA Version : 12.2

Attached GPUs : 1
GPU 00000000:00:09.0
Product Name : NVIDIA H800
Product Brand : NVIDIA
Product Architecture : Hopper
Display Mode : Disabled
Display Active : Disabled
Persistence Mode : Enabled
Addressing Mode : None
MIG Mode
Current : Disabled
Pending : Disabled
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 1653923045109
GPU UUID : GPU-027d1d58-d8e4-8dfd-7051-fdc747cf5a00
Minor Number : 0
VBIOS Version : 96.00.74.00.06
MultiGPU Board : No
Board ID : 0x9
Board Part Number : 692-2G520-0205-000
GPU Part Number : 2324-865-A1
FRU Part Number : N/A
Module ID : 4
Inforom Version
Image Version : G520.0205.00.02
OEM Object : 2.1
ECC Object : 7.16
Power Management Object : N/A
Inforom BBX Object Flush
Latest Timestamp : N/A
Latest Duration : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : 535.161.07
GPU Virtualization Mode
Virtualization Mode : Pass-Through
Host VGPU Mode : N/A
GPU Reset Status
Reset Required : No
Drain and Reset Recommended : No
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x00
Device : 0x09
Domain : 0x0000
Device Id : 0x232410DE
Bus Id : 00000000:00:09.0
Sub System Id : 0x17A610DE
GPU Link Info
PCIe Generation
Max : 5
Current : 5
Device Current : 5
Device Max : 5
Host Max : N/A
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 742 KB/s
Rx Throughput : 671 KB/s
Atomic Caps Inbound : N/A
Atomic Caps Outbound : FETCHADD_32 FETCHADD_64 SWAP_32 SWAP_64 CAS_32 CAS_64
Fan Speed : N/A
Performance State : P0
Clocks Event Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
Sparse Operation Mode : Disabled
FB Memory Usage
Total : 81559 MiB
Reserved : 551 MiB
Used : 0 MiB
Free : 81007 MiB
BAR1 Memory Usage
Total : 131072 MiB
Used : 1 MiB
Free : 131071 MiB
Conf Compute Protected Memory Usage
Total : 0 MiB
Used : 0 MiB
Free : 0 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
JPEG : 0 %
OFA : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
ECC Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
SRAM Correctable : 0
SRAM Uncorrectable Parity : 0
SRAM Uncorrectable SEC-DED : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Aggregate
SRAM Correctable : 0
SRAM Uncorrectable Parity : 0
SRAM Uncorrectable SEC-DED : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
SRAM Threshold Exceeded : No
Aggregate Uncorrectable SRAM Sources
SRAM L2 : 0
SRAM SM : 0
SRAM Microcontroller : 0
SRAM PCIE : 0
SRAM Other : 0
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows
Correctable Error : 0
Uncorrectable Error : 0
Pending : No
Remapping Failure Occurred : No
Bank Remap Availability Histogram
Max : 2560 bank(s)
High : 0 bank(s)
Partial : 0 bank(s)
Low : 0 bank(s)
None : 0 bank(s)
Temperature
GPU Current Temp : 29 C
GPU T.Limit Temp : 57 C
GPU Shutdown T.Limit Temp : -8 C
GPU Slowdown T.Limit Temp : -2 C
GPU Max Operating T.Limit Temp : 0 C
GPU Target Temperature : N/A
Memory Current Temp : 36 C
Memory Max Operating T.Limit Temp : 0 C
GPU Power Readings
Power Draw : 70.80 W
Current Power Limit : 700.00 W
Requested Power Limit : 700.00 W
Default Power Limit : 700.00 W
Min Power Limit : 200.00 W
Max Power Limit : 700.00 W
Module Power Readings
Power Draw : N/A
Current Power Limit : N/A
Requested Power Limit : N/A
Default Power Limit : N/A
Min Power Limit : N/A
Max Power Limit : N/A
Clocks
Graphics : 345 MHz
SM : 345 MHz
Memory : 2619 MHz
Video : 765 MHz
Applications Clocks
Graphics : 1980 MHz
Memory : 2619 MHz
Default Applications Clocks
Graphics : 1980 MHz
Memory : 2619 MHz
Deferred Clocks
Memory : N/A
Max Clocks
Graphics : 1980 MHz
SM : 1980 MHz
Memory : 2619 MHz
Video : 1545 MHz
Max Customer Boost Clocks
Graphics : 1980 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : 725.000 mV
Fabric
State : In Progress
Status : N/A
Processes : None

Mar 03 20:13:27 VM_1_2_centos systemd[1]: Starting NVIDIA fabric manager service…
Mar 03 20:13:27 VM_1_2_centos nv-fabricmanager[4566]: request to query NVSwitch device information from NVSwitch driver failed with error:WARNING Nothing to do [NV_WARN_NOT>
Mar 03 20:13:27 VM_1_2_centos systemd[1]: nvidia-fabricmanager.service: Control process exited, code=exited status=1
Mar 03 20:13:27 VM_1_2_centos systemd[1]: nvidia-fabricmanager.service: Failed with result ‘exit-code’.
Mar 03 20:13:27 VM_1_2_centos systemd[1]: Failed to start NVIDIA fabric manager service.

/proc/driver/nvidia-nvswitch/devices is empty

the machine is not setup properly, the fabric manager is being loaded incorrectly.

The GPU you have is a SXM GPU, part of a HGX baseboard that has multiple GPUs. Someone has provided you with an instance that has only a single GPU, but the machine was not set up properly prior to that. I won’t be able to offer suggestions to fix this. You’ll need to discuss the issue with the owner of the machine.