Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Erro

fay.y.wang · November 6, 2023, 8:25pm

We encountered a problem when running the following script:

import torch
print (‘torch version:’, torch.version)
is_avail = torch.cuda.is_available()
print (‘is_avail:’, is_avail)
cnt = torch.cuda.device_count()
print ('device cnt: ', cnt)
curr_device = torch.cuda.current_device()
print (‘curr_device:’, curr_device)
device = torch.device(‘cuda:0’)
print (device)

aa = torch.randn(5)
aa = tensor([-2.2084, -0.2700, 0.0921, -1.7678, 0.7642])
aa.to(device)
print (‘Done’)

Result:

torch version: 2.1.0+cu121
/usr/local/lib/python3.8/dist-packages/torch/cuda/init.py:138: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet initialized (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
return torch._C._cuda_getDeviceCount() > 0
is_avail: False
device cnt: 1
Traceback (most recent call last):
File “test_cuda1.py”, line 11, in
curr_device = torch.cuda.current_device()
File “/usr/local/lib/python3.8/dist-packages/torch/cuda/init.py”, line 769, in current_device
_lazy_init()
File “/usr/local/lib/python3.8/dist-packages/torch/cuda/init.py”, line 298, in _lazy_init
torch._C._cuda_init()
RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet initialized

Below is from nvidia-smi:

±--------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 1052 G /usr/lib/xorg/Xorg 4MiB |
±--------------------------------------------------------------------------------------+

We have tried uninstall/re-installing cuda/driver/pytorch with no use. Please advice.
The attached is the nvidia-bug-report.log.gz

Robert_Crovella · November 6, 2023, 9:01pm

The usual reasons for this are either an improper fabric manager install in a NVLink setup, or MIG mode improperly enabled.

Rather than using torch to figure this out, validate your CUDA install using the methods in the CUDA linux install guide.

Also, if it were me/my system, I wouldn’t use X enabled on an H800.

fay.y.wang · November 6, 2023, 9:29pm

Thanks, Robert. The nvidia-smi shows that MIG is disabled. Also, what is X enabled on H800? How to disable it?

Robert_Crovella · November 6, 2023, 9:36pm

So my guess would be fabric manager, then. You haven’t indicated much about the system this is running in, so its just a guess.

See here.

fay.y.wang · November 6, 2023, 9:52pm

Our server is Ubuntu 20.04.6 LTS. We are running pytorch inside a docker container in our server. The os in the docker container is Ubuntu 20.04.5 LTS. What other system info you need? I also have nvidia-bug-report.log.gz, but don’t know how to upload to the forum.

fay.y.wang · November 6, 2023, 10:01pm

root@3fed4b1a61a3:/tao-pt/test# nvidia-smi topo -m
GPU0 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X 0-19 0 N/A

fay.y.wang · November 6, 2023, 10:06pm

root@3fed4b1a61a3:/tao-pt/test# nvidia-smi -q -i 0 | grep -i -A 2 Fabric
Fabric
State : In Progress
Status : N/A
root@3fed4b1a61a3:/tao-pt/test#

Robert_Crovella · November 7, 2023, 1:57pm

Who is the manufacturer and what is the model number of the server? How many H800 GPUs are in the machine? Is it an HGX platform? What is the result of running nvidia-smi -a on the base machine (i.e. not in/from any container)?

msunshinelxl · March 25, 2024, 10:34am

I meet the same problem, with one GPU, i works fine, with two, it failed
test code:

import torch
def get_device():
  return 'cuda' if torch.cuda.is_available() else 'cpu'

device = get_device()
print(f'DEVICE: {device}')

print(torch.__version__)

both run in docker env

dc42e1d5d9bc(@:):/# nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Mon_May__3_19:15:13_PDT_2021
Cuda compilation tools, release 11.3, V11.3.109
Build cuda_11.3.r11.3/compiler.29920130_0
dc42e1d5d9bc(@:):/# apt search cudnn
Sorting... Done
Full Text Search... Done
libcudnn8/now 8.9.5.29-1+cuda11.8 amd64 [installed,local]
  cuDNN runtime libraries

libcudnn8-dev/now 8.9.5.29-1+cuda11.8 amd64 [installed,local]
  cuDNN development libraries and headers

the only differences is one use single GPU, the other use double

msunshinelxl · March 25, 2024, 12:19pm

for my case, it works with blow commands

ca406134a828(@:):/# CUDA_DEVICE_ORDER="PCI_BUS_ID" PYTORCH_NVML_BASED_CUDA_CHECK=1 CUDA_VISIBLE_DEVICES=0,1 python3 ./test.py 
DEVICE: cuda
2.2.1+cu121

ref: Unexpected error from cudaGetDeviceCount 错误解决_runtimeerror: unexpected error from cudagetdevicec-CSDN博客

Topic		Replies	Views
How to solve this problem? CUDA Setup and Installation cuda	0	308	February 18, 2025
Torch crashes driver on H100 CUDA Setup and Installation kernel	1	275	June 27, 2025
Error running cuda on VM with GPU passthrough. cuda.get_device_name() returns 802, not initialized CUDA Setup and Installation	6	1074	January 12, 2026
GH100 deviceQuery got cudaGetDeviceCount returned 802 CUDA Setup and Installation	1	819	March 4, 2024
Nvidia fabric manger initializing CUDA H100 Drivers - Linux, Windows, MacOS cuda , nvbugs , python	1	638	July 4, 2024
The torch-2.5.0a0+872d972e41.nv24.08 for JetPack-6.1 can't getDeviceCount Isaac ROS ros , pytorch	4	157	March 18, 2025
Error with B200 cuda setup with torch.cuda cannot load CUDA Setup and Installation	1	402	July 16, 2025
cudaGetDeviceCount returned 999 CUDA Setup and Installation	1	1908	December 5, 2021
Help for "cudaGetDeviceCount returned 38" after ./deviceQuery CUDA Setup and Installation	7	5236	November 14, 2017
The system does not see the device CUDA Setup and Installation	0	189	July 28, 2024

Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Erro

aa = torch.randn(5) aa = tensor([-2.2084, -0.2700, 0.0921, -1.7678, 0.7642]) aa.to(device) print (‘Done’)

Related topics

aa = torch.randn(5)
aa = tensor([-2.2084, -0.2700, 0.0921, -1.7678, 0.7642])
aa.to(device)
print (‘Done’)