Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Erro

We encountered a problem when running the following script:


import torch
print (‘torch version:’, torch.version)
is_avail = torch.cuda.is_available()
print (‘is_avail:’, is_avail)
cnt = torch.cuda.device_count()
print ('device cnt: ', cnt)
curr_device = torch.cuda.current_device()
print (‘curr_device:’, curr_device)
device = torch.device(‘cuda:0’)
print (device)

aa = torch.randn(5)
aa = tensor([-2.2084, -0.2700, 0.0921, -1.7678, 0.7642])
aa.to(device)
print (‘Done’)

Result:

torch version: 2.1.0+cu121
/usr/local/lib/python3.8/dist-packages/torch/cuda/init.py:138: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet initialized (Triggered internally at …/c10/cuda/CUDAFunctions.cpp:108.)
return torch._C._cuda_getDeviceCount() > 0
is_avail: False
device cnt: 1
Traceback (most recent call last):
File “test_cuda1.py”, line 11, in
curr_device = torch.cuda.current_device()
File “/usr/local/lib/python3.8/dist-packages/torch/cuda/init.py”, line 769, in current_device
_lazy_init()
File “/usr/local/lib/python3.8/dist-packages/torch/cuda/init.py”, line 298, in _lazy_init
torch._C._cuda_init()
RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet initialized

Below is from nvidia-smi:

±--------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 |
|-----------------------------------------±---------------------±---------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA H800 Off | 00000000:17:00.0 Off | 0 |
| N/A 29C P0 73W / 700W | 18MiB / 81559MiB | 0% Default |
| | | Disabled |
±----------------------------------------±---------------------±---------------------+

±--------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 1052 G /usr/lib/xorg/Xorg 4MiB |
±--------------------------------------------------------------------------------------+

We have tried uninstall/re-installing cuda/driver/pytorch with no use. Please advice.
The attached is the nvidia-bug-report.log.gz

The usual reasons for this are either an improper fabric manager install in a NVLink setup, or MIG mode improperly enabled.

Rather than using torch to figure this out, validate your CUDA install using the methods in the CUDA linux install guide.

Also, if it were me/my system, I wouldn’t use X enabled on an H800.

Thanks, Robert. The nvidia-smi shows that MIG is disabled. Also, what is X enabled on H800? How to disable it?

So my guess would be fabric manager, then. You haven’t indicated much about the system this is running in, so its just a guess.

See here.

Our server is Ubuntu 20.04.6 LTS. We are running pytorch inside a docker container in our server. The os in the docker container is Ubuntu 20.04.5 LTS. What other system info you need? I also have nvidia-bug-report.log.gz, but don’t know how to upload to the forum.

root@3fed4b1a61a3:/tao-pt/test# nvidia-smi topo -m
GPU0 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X 0-19 0 N/A

root@3fed4b1a61a3:/tao-pt/test# nvidia-smi -q -i 0 | grep -i -A 2 Fabric
Fabric
State : In Progress
Status : N/A
root@3fed4b1a61a3:/tao-pt/test#

Who is the manufacturer and what is the model number of the server? How many H800 GPUs are in the machine? Is it an HGX platform? What is the result of running nvidia-smi -a on the base machine (i.e. not in/from any container)?