CUDA application only works on second GPU with sudo

Hi all,

I have a workstation with two GPUs installed (Ubuntu 16.04, CUDA 10, CuDNN 7.4.2)

I’m trying to run yolov3 - it runs as expected on GPU 0 but fails on GPU 1 - unless I run with sudo.

ie.

⟫ ./darknet -i 0 detector test cfg/coco.data cfg/yolov3.cfg yolov3.weights data/dog.jpg
[expected output]

⟫ sudo ./darknet -i 1 detector test cfg/coco.data cfg/yolov3.cfg yolov3.weights data/dog.jpg
[expected output]

⟫ ./darknet -i 1 detector test cfg/coco.data cfg/yolov3.cfg yolov3.weights data/dog.jpg
CUDA Error: invalid device ordinal
darknet: ./src/cuda.c:36: check_error: Assertion `0' failed.
Aborted (core dumped)

How can I run on GPU 1 without sudo?

⟫ nvidia-smi
Fri Feb  1 13:02:54 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.79       Driver Version: 410.79       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Quadro M4000        Off  | 00000000:02:00.0  On |                  N/A |
| 46%   35C    P8    13W / 120W |    183MiB /  8126MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K40c          Off  | 00000000:81:00.0 Off |                    0 |
| 23%   40C    P8    23W / 235W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1557      G   /usr/lib/xorg/Xorg                           137MiB |
|    0      2373      G   compiz                                        42MiB |
+-----------------------------------------------------------------------------+

[url]http://www.resultsovercoffee.com/2011/01/cuda-in-runlevel-3.html[/url]

[url]https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#runfile-verifications[/url]

Hi, Nouveau appears to not be loaded (trying: lsmod | grep nouveau) (and I installed CUDA via dpkg, not the .run)

The two entries in /dev appear to have identical permissions - but lack the setuid bit? Maybe this is the problem?

⟫ ls -l /dev/nvidia*
crw-rw-rw- 1 root root 195,   0 Feb  1 13:20 /dev/nvidia0
crw-rw-rw- 1 root root 195,   1 Feb  1 13:20 /dev/nvidia1
crw-rw-rw- 1 root root 195, 255 Feb  1 13:20 /dev/nvidiactl
crw-rw-rw- 1 root root 195, 254 Feb  1 13:20 /dev/nvidia-modeset
crw-rw-rw- 1 root root 238,   0 Feb  1 13:20 /dev/nvidia-uvm

I tried downloading and executing the script in 4.4 Device Node Verification, which fails as follows:

⟫ sudo ./config_setuid_nvidia.sh
mknod: /dev/nvidia0: File exists
mknod: /dev/nvidia1: File exists
mknod: /dev/nvidiactl: File exists
mknod: /dev/nvidia-uvm: File exists

Do these permissions look wrong? Any ideas on how I can move forward?

The permissions appear to be OK.

what is the output of deviceQuery on your system? I don’t really need the whole output, just need the cuda enumeration order of the devices

If I run as regular user:

./samples/bin/x86_64/linux/release/deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "Tesla K40c"
  [................]
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.0, CUDA Runtime Version = 10.0, NumDevs = 1
Result = PASS

If I run with sudo…

./samples/bin/x86_64/linux/release/deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 2 CUDA Capable device(s)

Device 0: "Quadro M4000"
[.........]

Device 1: "Tesla K40c"
[........]

OK FIXED NOW

The problem was I had put this: export CUDA_VISIBLE_DEVICES=1
in my .bashrc and forgot about it. I did so to try and get GPU #1 to be used by default.

eye roll

If you want GPU 1 to be used by default, you could do:

export CUDA_VISIBLE_DEVICES=“1,0”

which will reverse the enumeration order, but still make both available.

Thanks, I’ve now done so, I think this was my original intention.