nvidia-smi have answer only with this in grub nvidia.NVreg_EnableGpuFirmware=0
when i remove nvidia-smi no device found.
aipath@aipath-TRX50-AERO-D:~$ dcgmi diag -r 3
Successfully ran diagnostic for group.
±--------------------------±-----------------------------------------------+
| Diagnostic | Result |
+===========================+================================================+
|----- Metadata ----------±-----------------------------------------------|
| DCGM Version | 3.3.7 |
| Driver Version Detected | 560.28.03 |
| GPU Device IDs Detected | 2331 |
|----- Deployment --------±-----------------------------------------------|
| Denylist | Pass |
| NVML Library | Pass |
| CUDA Main Library | Pass |
| Permissions and OS Blocks | Pass |
| Persistence Mode | Pass |
| Info | Persistence mode for GPU 0 is disabled. Enabl |
| | e persistence mode by running "nvidia-smi -i |
| | -pm 1 " as root. |
| Environment Variables | Pass |
| Page Retirement/Row Remap | Fail |
| Error | GPU 0 had uncorrectable memory errors and row |
| | remapping failed. Run a field diagnostic on |
| | the GPU. |
| Graphics Processes | Pass |
| Inforom | Pass |
±---- Integration -------±-----------------------------------------------+
| PCIe | Fail - All |
| Warning | GPU 0 Error using CUDA API cudaDeviceGetByPCI |
| | BusId Check DCGM and system logs for errors. |
| | Reset GPU. Restart DCGM. Rerun diagnostics. ’ |
| | initialization error’ for GPU 0, bus ID = 000 |
| | 00000:81:00.0 |
±---- Hardware ----------±-----------------------------------------------+
| GPU Memory | Fail - All |
| Warning | GPU 0 Error using CUDA API cuInit Check DCGM |
| | and system logs for errors. Reset GPU. Restar |
| | t DCGM. Rerun diagnostics. Unable to initiali |
| | ze CUDA library: ‘initialization error’. ; ve |
| | rify that the fabric-manager has been started |
| | if applicable. Please check if a CUDA sample |
| | program can be run successfully on this host |
| | . Refer to https://github.com/nvidia/cuda-sam |
| | ples |
| Diagnostic | Fail - All |
| Warning | GPU 0 API call cudaDeviceGetByPCIBusId failed |
| | : ‘initialization error’ Check DCGM and syste |
| | m logs for errors. Reset GPU. Restart DCGM. R |
| | erun diagnostics. |
| Warning | GPU 0 There was an internal error during the |
| | test: ‘Failed to initialize the plugin.’ Chec |
| | k DCGM and system logs for errors. Reset GPU. |
| | Restart DCGM. Rerun diagnostics. |
| Warning | GPU 0 Error using CUDA API cudaDeviceGetByPCI |
| | BusId Check DCGM and system logs for errors. |
| | Reset GPU. Restart DCGM. Rerun diagnostics. ’ |
| | initialization error’ for GPU 0, bus ID = 000 |
| | 00000:81:00.0 |
±---- Stress ------------±-----------------------------------------------+
| Memory Bandwidth | Fail - All |
| Warning | GPU 0 API call cuInit failed: ‘initialization |
| | error; verify that the fabric-manager has be |
| | en started if applicable. Please check if a C |
| | UDA sample program can be run successfully on |
| | this host. Refer to nvidi · GitHub |
| | a/cuda-samples’ Check DCGM and system logs fo |
| | r errors. Reset GPU. Restart DCGM. Rerun diag |
| | nostics. Please check if a CUDA sample progra |
| | m can be run successfully on this host. Refer |
| | to GitHub - NVIDIA/cuda-samples: Samples for CUDA Developers which demonstrates features in CUDA Toolkit |
| EUD Test | Skip - All |
±--------------------------±-----------------------------------------------+
I want to know if the H100 has a hardware issue or if the problem is related to the firmware of the card. I remain at your disposal for any further clarification.
nvidia-bug-report.zip (620.9 KB)
fully report you also can see the nvidia-smi response:
aipath@aipath-TRX50-AERO-D:~$ nvidia-smi
Sat Aug 24 09:17:32 2024
±----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.28.03 Driver Version: 560.28.03 CUDA Version: 12.6 |
|-----------------------------------------±-----------------------±---------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA H100 PCIe Off | 00000000:81:00.0 Off | 232 |
| N/A 36C P0 82W / 350W | 1MiB / 81559MiB | 0% Default |
| | | Disabled |
±----------------------------------------±-----------------------±---------------------+
±----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
±----------------------------------------------------------------------------------------+
when i remove nvidia.NVreg_EnableGpuFirmware=0 from grub
aipath@aipath-TRX50-AERO-D:~$ sudo dmesg | grep -i nvidia
[sudo] password for aipath:
[ 6.827324] nvidia: loading out-of-tree module taints kernel.
[ 6.827331] nvidia: module license ‘NVIDIA’ taints kernel.
[ 6.827335] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[ 6.827337] nvidia: module license taints kernel.
[ 6.895998] nvidia-nvlink: Nvlink Core is being initialized, major device number 510
[ 6.897383] nvidia 0000:81:00.0: enabling device (0000 → 0002)
[ 6.907352] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 560.35.03 Fri Aug 16 21:39:15 UTC 2024
[ 6.916828] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 560.35.03 Fri Aug 16 21:21:48 UTC 2024
[ 7.873720] [drm] [nvidia-drm] [GPU ID 0x00008100] Loading driver
[ 11.457859] nvidia_uvm: module uses symbols nvUvmInterfaceDisableAccessCntr from proprietary module nvidia, inheriting taint.
[ 12.421647] nvidia-uvm: Loaded the UVM driver, major device number 508.
[ 15.391777] [drm:nv_drm_load [nvidia_drm]] ERROR [nvidia-drm] [GPU ID 0x00008100] Failed to allocate NvKmsKapiDevice
[ 15.391891] [drm:nv_drm_register_drm_device [nvidia_drm]] ERROR [nvidia-drm] [GPU ID 0x00008100] Failed to register device
Now that I’ve disabled the firmware and switched from X11 to Wayland, the CUDA samples tests have passed. I’m confused about how this is happening. Could I please get some help understanding this?
nvidia-bug-report.zip (1.2 MB)
Let me know if you need any further assistance!
aipath@aipath-TRX50-AERO-D:~/cuda-samples/Samples/1_Utilities/deviceQuery$ ./deviceQuery
./deviceQuery Starting…
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: “NVIDIA H100 PCIe”
CUDA Driver Version / Runtime Version 12.5 / 12.6
CUDA Capability Major/Minor version number: 9.0
Total amount of global memory: 81344 MBytes (85294972928 bytes)
(114) Multiprocessors, (128) CUDA Cores/MP: 14592 CUDA Cores
GPU Max Clock rate: 1755 MHz (1.75 GHz)
Memory Clock rate: 1593 Mhz
Memory Bus Width: 5120-bit
L2 Cache Size: 52428800 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total shared memory per multiprocessor: 233472 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 3 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Enabled
Device supports Unified Addressing (UVA): Yes
Device supports Managed Memory: Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 129 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 12.5, CUDA Runtime Version = 12.6, NumDevs = 1
Result = PASS
aipath@aipath-TRX50-AERO-D:~/cuda-samples/Samples/1_Utilities/deviceQuery$ cd /home/aipath
aipath@aipath-TRX50-AERO-D:~$ nvidia-smi
Sun Sep 1 11:35:26 2024
±----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.58.02 Driver Version: 555.58.02 CUDA Version: 12.5 |
|-----------------------------------------±-----------------------±---------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA H100 PCIe Off | 00000000:81:00.0 Off | 52 |
| N/A 36C P0 52W / 350W | 9MiB / 81559MiB | 0% Default |
| | | Disabled |
±----------------------------------------±-----------------------±---------------------+
±----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 2651 G /usr/bin/gnome-shell 5MiB |
±----------------------------------------------------------------------------------------+
I would like someone to answer if they know whether we can handle the issue ourselves or if the H100 needs to be sent to the vendor for repair.
aipath@aipath-TRX50-AERO-D:~$ nvidia-smi -q -d ROW_REMAPPER
==============NVSMI LOG==============
Timestamp : Wed Sep 4 14:49:02 2024
Driver Version : 555.58.02
CUDA Version : 12.5
Attached GPUs : 1
GPU 00000000:81:00.0
Remapped Rows
Correctable Error : 0
Uncorrectable Error : 6
Pending : No
Remapping Failure Occurred : Yes
Bank Remap Availability Histogram
Max : 1274 bank(s)
High : 6 bank(s)
Partial : 0 bank(s)
Low : 0 bank(s)
None : 0 bank(s)
Our vendor replaced the H100 by sending us a new one, and the issue was resolved. What I want to highlight is that the replacement was done extremely quickly. Thank you!!
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.