I am getting the following error from the sample code:
import torch
torch.zeros((2,2)).to(torch.device("cuda")
However I have 4 GPUs installed all with abundant memory and no running processes.
Thu Jan 23 23:50:29 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX TIT... Off | 00000000:05:00.0 Off | N/A |
| 26% 63C P0 76W / 250W | 0MiB / 12212MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 108... Off | 00000000:06:00.0 Off | N/A |
| 39% 51C P0 58W / 250W | 0MiB / 11178MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce GTX 108... Off | 00000000:09:00.0 Off | N/A |
| 37% 49C P0 60W / 250W | 0MiB / 11178MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 GeForce GTX 108... Off | 00000000:0A:00.0 Off | N/A |
| 23% 39C P0 57W / 250W | 0MiB / 11178MiB | 3% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Moreover all GPUs have compute mode set to Default. So its not a permissions issue.
System Details
Ubuntu 16.04
NVIDIA-SMI 440.33.01
Driver Version: 440.33.01
CUDA Version: 10.2
I get the same issue when compiling C code with nvcc and running something as simple as cudaalloc
.
My code does detect GPUs. For example the following code:
#include <stdio.h>
#include <cuda_runtime_api.h>
#include <cuda.h>
int main() {
int nDevices;
cudaGetDeviceCount(&nDevices);
for (int i = 0; i < nDevices; i++) {
cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, i);
printf("Device Number: %d\n", i);
printf(" Device name: %s\n", prop.name);
printf(" Memory Clock Rate (KHz): %d\n",
prop.memoryClockRate);
printf(" Memory Bus Width (bits): %d\n",
prop.memoryBusWidth);
printf(" Peak Memory Bandwidth (GB/s): %f\n\n",
2.0*prop.memoryClockRate*(prop.memoryBusWidth/8)/1.0e6);
}
}
Prints out:
Device Number: 0
Device name: GeForce GTX 1080 Ti
Memory Clock Rate (KHz): 5505000
Memory Bus Width (bits): 352
Peak Memory Bandwidth (GB/s): 484.440000
Device Number: 1
Device name: GeForce GTX 1080 Ti
Memory Clock Rate (KHz): 5505000
Memory Bus Width (bits): 352
Peak Memory Bandwidth (GB/s): 484.440000
Device Number: 2
Device name: GeForce GTX 1080 Ti
Memory Clock Rate (KHz): 5505000
Memory Bus Width (bits): 352
Peak Memory Bandwidth (GB/s): 484.440000
Device Number: 3
Device name: GeForce GTX TITAN X
Memory Clock Rate (KHz): 3505000
Memory Bus Width (bits): 384
Peak Memory Bandwidth (GB/s): 336.480000
Interesting enough when I run
nvidia-smi -r
I get the error
GPU Reset couldn't run because GPU 00000000:05:00.0 is the primary GPU.
. But if I try to reset any of the remaining 3 GPUs I do not get this error. I tried disabling this GPU and running the code with the remaking GPUs to no luck. Could this be a hardware installation issue? I tried rebooting my machine which did not help either.