Hi there,
I have an H100 GPU and installed CUDA 12.2 driver on it.
But, when I am trying to run some AI stuff on it , it is not detecting GPU.
I tried with multiple libraries like torch, tensorflow, rapids but no luck.
I am attaching a the errors I am encountering for each and every library:-
Torch:-
Note: Latest version of torch most probably supports 12.1 CUDA but we have 12.2 CUDA drivers installed with Confidential Compute capability.
`torch.cuda.is_available()
/home/nvidia/.local/lib/python3.10/site-packages/torch/cuda/__init__.py:138: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet initialized (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
return torch._C._cuda_getDeviceCount() > 0
False`
Tensorflow:-
`2024-01-25 09:03:06.641651: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-25 09:03:06.642114: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-25 09:03:06.643253: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-01-25 09:03:06.648846: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-01-25 09:03:07.148088: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT`
Rapids:-
`>>> import cudf
/home/nvidia/.local/lib/python3.10/site-packages/cudf/utils/_ptxcompiler.py:61: UserWarning: Error getting driver and runtime versions:
stdout:
stderr:
Traceback (most recent call last):
File "/home/nvidia/.local/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py", line 267, in ensure_initialized
self.cuInit(0)
File "/home/nvidia/.local/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py", line 340, in safe_cuda_api_call
self._check_ctypes_error(fname, retcode)
File "/home/nvidia/.local/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py", line 408, in _check_ctypes_error
raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [802] Call to cuInit results in UNKNOWN_CUDA_ERROR
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<string>", line 4, in <module>
File "/home/nvidia/.local/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py", line 305, in __getattr__
self.ensure_initialized()
File "/home/nvidia/.local/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py", line 271, in ensure_initialized
raise CudaSupportError(f"Error at driver init: {description}")
numba.cuda.cudadrv.error.CudaSupportError: Error at driver init: Call to cuInit results in UNKNOWN_CUDA_ERROR (802)
Not patching Numba
warnings.warn(msg, UserWarning)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/nvidia/.local/lib/python3.10/site-packages/cudf/__init__.py", line 10, in <module>
validate_setup()
File "/home/nvidia/.local/lib/python3.10/site-packages/cudf/utils/gpu_utils.py", line 55, in validate_setup
raise e
File "/home/nvidia/.local/lib/python3.10/site-packages/cudf/utils/gpu_utils.py", line 52, in validate_setup
gpus_count = getDeviceCount()
File "/home/nvidia/.local/lib/python3.10/site-packages/rmm/_cuda/gpu.py", line 102, in getDeviceCount
raise CUDARuntimeError(status)
rmm._cuda.gpu.CUDARuntimeError: cudaErrorSystemNotReady: system not yet initialized`