Not able to run AI workloads on H100 GPU

techgig · January 25, 2024, 11:14am

Hi there,
I have an H100 GPU and installed CUDA 12.2 driver on it.
But, when I am trying to run some AI stuff on it , it is not detecting GPU.

I tried with multiple libraries like torch, tensorflow, rapids but no luck.

I am attaching a the errors I am encountering for each and every library:-
Torch:-

Note: Latest version of torch most probably supports 12.1 CUDA but we have 12.2 CUDA drivers installed with Confidential Compute capability.

`torch.cuda.is_available()
/home/nvidia/.local/lib/python3.10/site-packages/torch/cuda/__init__.py:138: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet initialized (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
  return torch._C._cuda_getDeviceCount() > 0
False`

Tensorflow:-

`2024-01-25 09:03:06.641651: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-25 09:03:06.642114: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-25 09:03:06.643253: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-01-25 09:03:06.648846: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-01-25 09:03:07.148088: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT`

Rapids:-

`>>> import cudf
/home/nvidia/.local/lib/python3.10/site-packages/cudf/utils/_ptxcompiler.py:61: UserWarning: Error getting driver and runtime versions:

stdout:



stderr:

Traceback (most recent call last):
  File "/home/nvidia/.local/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py", line 267, in ensure_initialized
    self.cuInit(0)
  File "/home/nvidia/.local/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py", line 340, in safe_cuda_api_call
    self._check_ctypes_error(fname, retcode)
  File "/home/nvidia/.local/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py", line 408, in _check_ctypes_error
    raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [802] Call to cuInit results in UNKNOWN_CUDA_ERROR

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<string>", line 4, in <module>
  File "/home/nvidia/.local/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py", line 305, in __getattr__
    self.ensure_initialized()
  File "/home/nvidia/.local/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py", line 271, in ensure_initialized
    raise CudaSupportError(f"Error at driver init: {description}")
numba.cuda.cudadrv.error.CudaSupportError: Error at driver init: Call to cuInit results in UNKNOWN_CUDA_ERROR (802)


Not patching Numba
  warnings.warn(msg, UserWarning)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/nvidia/.local/lib/python3.10/site-packages/cudf/__init__.py", line 10, in <module>
    validate_setup()
  File "/home/nvidia/.local/lib/python3.10/site-packages/cudf/utils/gpu_utils.py", line 55, in validate_setup
    raise e
  File "/home/nvidia/.local/lib/python3.10/site-packages/cudf/utils/gpu_utils.py", line 52, in validate_setup
    gpus_count = getDeviceCount()
  File "/home/nvidia/.local/lib/python3.10/site-packages/rmm/_cuda/gpu.py", line 102, in getDeviceCount
    raise CUDARuntimeError(status)
rmm._cuda.gpu.CUDARuntimeError: cudaErrorSystemNotReady: system not yet initialized`

nishant.chanduka · June 25, 2024, 10:37am

±----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
±----------------------------------------------------------------------------------------+

python
Python 3.10.14 (main, Mar 21 2024, 16:24:04) [GCC 11.2.0] on linux
Type “help”, “copyright”, “credits” or “license” for more information.

import torch
torch.init()
Traceback (most recent call last):
File “”, line 1, in
AttributeError: module ‘torch’ has no attribute ‘init’
torch.cuda.init()
Traceback (most recent call last):
File “”, line 1, in
File “/opt/conda/lib/python3.10/site-packages/torch/cuda/init.py”, line 216, in init
_lazy_init()
File “/opt/conda/lib/python3.10/site-packages/torch/cuda/init.py”, line 247, in _lazy_init
torch._C._cuda_init()
RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet initialized
torch.cuda.is_initialized()
False
print(torch.cuda.is_available())
/opt/conda/lib/python3.10/site-packages/torch/cuda/init.py:107: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet initialized (Triggered internally at …/c10/cuda/CUDAFunctions.cpp:109.)
return torch._C._cuda_getDeviceCount() > 0
False
print(torch.cuda.device_count())
1
print(torch.cuda.get_device_name(0))
Traceback (most recent call last):
File “”, line 1, in
File “/opt/conda/lib/python3.10/site-packages/torch/cuda/init.py”, line 365, in get_device_name
return get_device_properties(device).name
File “/opt/conda/lib/python3.10/site-packages/torch/cuda/init.py”, line 395, in get_device_properties
_lazy_init() # will define _get_device_properties
File “/opt/conda/lib/python3.10/site-packages/torch/cuda/init.py”, line 247, in _lazy_init
torch._C._cuda_init()
RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet initialized

linnansuda · December 19, 2024, 5:16am

Hi ,i met the same issue, have you solved yet?

techgig · December 19, 2024, 5:34am

You can follow the steps mentioned here:
https://docs.nvidia.com/cc-deployment-guide-tdx.pdf

linnansuda · December 19, 2024, 6:16am

Thank you for the reply, so the only supported driver version is nvidia-driver-550-server-open ? i have tried 550 before, but the issue stayed the same, could you please help offer the correct steps for driver install?

lodna · December 27, 2024, 9:15am

How did you solve it? I have the same issue with a H100 NVL PCIe

wiresurfer · December 28, 2024, 3:52am

I’m having the same problem and I followed the same guide. Any support here, Nvidia?

Topic		Replies	Views
Failed call to cuInit CUDA_ERROR_NOT_INITIALIZED (Device mapping: no known devices) CUDA Setup and Installation	7	6307	November 27, 2018
kernel version 440.31.0 does not match DSO version 440.33.1 — cannot find working devices in this configuration Linux	4	20793	December 12, 2019
all CUDA-capable devices are busy or unavailable. What is wrong? cuDNN	10	9158	October 12, 2021
CUDA 10.2 & Tensorflow 2.0. Getting an error when testing Tensorflow CUDA Setup and Installation	7	20864	March 20, 2020
Getting “CUDA_ERROR_INVALID_VALUE: invalid argument” in python with Tensorflow 1.14 cuDNN cuda	3	2439	April 29, 2020
Tensorflow fails to find libcudart CUDA on Windows Subsystem for Linux	7	18412	September 23, 2020
Nvidia driver conflict CUDA_ERROR_NO_DEVICE Linux	10	9708	June 28, 2018
Tensorflow coredump no supported devices found for CUDA (Docker nvcr.io container), after reboot nvidia-smi can't find driver Linux cuda , tensorflow	2	2548	October 8, 2020
Failed call to cuInit: CUDA_ERROR_OUT_OF_MEMORY: out of memory Frameworks cuda , tensorflow	1	2906	April 22, 2021
Segmentation fault in Tensorflow 2.0 object detection api Frameworks tensorflow	7	2491	January 17, 2020

Not able to run AI workloads on H100 GPU

Related topics