We encountered a problem when running the following script:
import torch
print (‘torch version:’, torch.version)
is_avail = torch.cuda.is_available()
print (‘is_avail:’, is_avail)
cnt = torch.cuda.device_count()
print ('device cnt: ', cnt)
curr_device = torch.cuda.current_device()
print (‘curr_device:’, curr_device)
device = torch.device(‘cuda:0’)
print (device)
aa = torch.randn(5)
aa = tensor([-2.2084, -0.2700, 0.0921, -1.7678, 0.7642])
aa.to(device)
print (‘Done’)
Result:
torch version: 2.1.0+cu121
/usr/local/lib/python3.8/dist-packages/torch/cuda/init.py:138: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet initialized (Triggered internally at …/c10/cuda/CUDAFunctions.cpp:108.)
return torch._C._cuda_getDeviceCount() > 0
is_avail: False
device cnt: 1
Traceback (most recent call last):
File “test_cuda1.py”, line 11, in
curr_device = torch.cuda.current_device()
File “/usr/local/lib/python3.8/dist-packages/torch/cuda/init.py”, line 769, in current_device
_lazy_init()
File “/usr/local/lib/python3.8/dist-packages/torch/cuda/init.py”, line 298, in _lazy_init
torch._C._cuda_init()
RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet initialized
Below is from nvidia-smi:
±--------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 |
|-----------------------------------------±---------------------±---------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA H800 Off | 00000000:17:00.0 Off | 0 |
| N/A 29C P0 73W / 700W | 18MiB / 81559MiB | 0% Default |
| | | Disabled |
±----------------------------------------±---------------------±---------------------+
±--------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 1052 G /usr/lib/xorg/Xorg 4MiB |
±--------------------------------------------------------------------------------------+
We have tried uninstall/re-installing cuda/driver/pytorch with no use. Please advice.
The attached is the nvidia-bug-report.log.gz