Not able to run AI workloads on H100 GPU

Hi there,
I have an H100 GPU and installed CUDA 12.2 driver on it.
But, when I am trying to run some AI stuff on it , it is not detecting GPU.

I tried with multiple libraries like torch, tensorflow, rapids but no luck.

I am attaching a the errors I am encountering for each and every library:-
Torch:-

Note: Latest version of torch most probably supports 12.1 CUDA but we have 12.2 CUDA drivers installed with Confidential Compute capability.

`torch.cuda.is_available()
/home/nvidia/.local/lib/python3.10/site-packages/torch/cuda/__init__.py:138: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet initialized (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
  return torch._C._cuda_getDeviceCount() > 0
False`

Tensorflow:-

`2024-01-25 09:03:06.641651: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-25 09:03:06.642114: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-25 09:03:06.643253: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-01-25 09:03:06.648846: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-01-25 09:03:07.148088: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT`

Rapids:-

`>>> import cudf
/home/nvidia/.local/lib/python3.10/site-packages/cudf/utils/_ptxcompiler.py:61: UserWarning: Error getting driver and runtime versions:

stdout:



stderr:

Traceback (most recent call last):
  File "/home/nvidia/.local/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py", line 267, in ensure_initialized
    self.cuInit(0)
  File "/home/nvidia/.local/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py", line 340, in safe_cuda_api_call
    self._check_ctypes_error(fname, retcode)
  File "/home/nvidia/.local/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py", line 408, in _check_ctypes_error
    raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [802] Call to cuInit results in UNKNOWN_CUDA_ERROR

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<string>", line 4, in <module>
  File "/home/nvidia/.local/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py", line 305, in __getattr__
    self.ensure_initialized()
  File "/home/nvidia/.local/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py", line 271, in ensure_initialized
    raise CudaSupportError(f"Error at driver init: {description}")
numba.cuda.cudadrv.error.CudaSupportError: Error at driver init: Call to cuInit results in UNKNOWN_CUDA_ERROR (802)


Not patching Numba
  warnings.warn(msg, UserWarning)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/nvidia/.local/lib/python3.10/site-packages/cudf/__init__.py", line 10, in <module>
    validate_setup()
  File "/home/nvidia/.local/lib/python3.10/site-packages/cudf/utils/gpu_utils.py", line 55, in validate_setup
    raise e
  File "/home/nvidia/.local/lib/python3.10/site-packages/cudf/utils/gpu_utils.py", line 52, in validate_setup
    gpus_count = getDeviceCount()
  File "/home/nvidia/.local/lib/python3.10/site-packages/rmm/_cuda/gpu.py", line 102, in getDeviceCount
    raise CUDARuntimeError(status)
rmm._cuda.gpu.CUDARuntimeError: cudaErrorSystemNotReady: system not yet initialized`
2 Likes

@techgig where you able to resolve the issue.
I am using a H100 GPU system too.
nvidia-smi
Tue Jun 25 10:28:38 2024
±----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.02 Driver Version: 555.42.02 CUDA Version: 12.5 |
|-----------------------------------------±-----------------------±---------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA H100 80GB HBM3 Off | 00000000:03:00.0 Off | 0 |
| N/A 32C P0 72W / 700W | 1MiB / 81559MiB | 0% Default |
| | | Disabled |
±----------------------------------------±-----------------------±---------------------+
| 1 NVIDIA H100 80GB HBM3 Off | 00000000:0C:00.0 Off | 0 |
| N/A 29C P0 73W / 700W | 1MiB / 81559MiB | 0% Default |
| | | Disabled |
±----------------------------------------±-----------------------±---------------------+

±----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
±----------------------------------------------------------------------------------------+

python
Python 3.10.14 (main, Mar 21 2024, 16:24:04) [GCC 11.2.0] on linux
Type “help”, “copyright”, “credits” or “license” for more information.

import torch
torch.init()
Traceback (most recent call last):
File “”, line 1, in
AttributeError: module ‘torch’ has no attribute ‘init’
torch.cuda.init()
Traceback (most recent call last):
File “”, line 1, in
File “/opt/conda/lib/python3.10/site-packages/torch/cuda/init.py”, line 216, in init
_lazy_init()
File “/opt/conda/lib/python3.10/site-packages/torch/cuda/init.py”, line 247, in _lazy_init
torch._C._cuda_init()
RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet initialized
torch.cuda.is_initialized()
False
print(torch.cuda.is_available())
/opt/conda/lib/python3.10/site-packages/torch/cuda/init.py:107: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet initialized (Triggered internally at …/c10/cuda/CUDAFunctions.cpp:109.)
return torch._C._cuda_getDeviceCount() > 0
False
print(torch.cuda.device_count())
1
print(torch.cuda.get_device_name(0))
Traceback (most recent call last):
File “”, line 1, in
File “/opt/conda/lib/python3.10/site-packages/torch/cuda/init.py”, line 365, in get_device_name
return get_device_properties(device).name
File “/opt/conda/lib/python3.10/site-packages/torch/cuda/init.py”, line 395, in get_device_properties
_lazy_init() # will define _get_device_properties
File “/opt/conda/lib/python3.10/site-packages/torch/cuda/init.py”, line 247, in _lazy_init
torch._C._cuda_init()
RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet initialized

Hi ,i met the same issue, have you solved yet?

You can follow the steps mentioned here:
https://docs.nvidia.com/cc-deployment-guide-tdx.pdf

Thank you for the reply, so the only supported driver version is nvidia-driver-550-server-open ? i have tried 550 before, but the issue stayed the same, could you please help offer the correct steps for driver install?

How did you solve it? I have the same issue with a H100 NVL PCIe

I’m having the same problem and I followed the same guide. Any support here, Nvidia?