need to run pgaccelinfo as root once

[ CentOS 5.3 machine, compilers 9.0-1 and 9.0-4, 2 Tesla C1060s. ]

If root doesn’t run pgaccelinfo once after a reboot, users get “No accelerators found”.

Running pgaccelinfo as root then magically makes everything right and users can see/use the GPUs.

I’ve temporarily added a pgaccelinfo in an rc file, but why is this once-off needed?

thanks

Hi tonycurtis,

We have the same issue here with our systems and it’s my understanding that it’s a general problem with NVIDIA. What our IT guys have told me is that the “/dev/nvidiaN” directories need to be created before a user can run a job on the NVIDIA devices. These directories either need to manually created or are automatically created when the first job is run. However, since only root has permissions to create these directories, root must be the first to run a job.

What we do is create a “/etc/init.d/nvidia” file with the following (also add “/etc/init.d/nvidia” to boot.local).

/etc/init.d% cat nvidia
#!/bin/sh

mknod /dev/nvidia0 c 195 0
mknod /dev/nvidia1 c 195 1
mknod /dev/nvidiactl c 195 255
chown root.video /dev/nvidia*
chmod 0666 /dev/nvidia*

# need this to spool down the fan
/usr/pgi/linux86-64/9.0-1/bin/pgaccelinfo

You would need to adjust for the number of devices (this system has two) and change the group when setting the permissions. Note that pgaccelinfo is left in because until the Nvidia driver is loaded, the fans on our devices run at full speed.

Hope this helps,
Mat

the “/dev/nvidiaN” directories need to be created before a user can run a job on the NVIDIA devices

actually, I knew that but had had a brain malfunction :-)

the pgaccelinfo enumerates the cards (2 here) and creates all the appropriate devices.

I have a similar issue “No accelerators found” on a cluster where cuda/10.0 works fine but loading cuda/10.1 shows up the error. Is there any env variable that needs to be defined or cuda specific setting that is missing cuda/10.1 onwards?

Configuration : PGI 19.10 community edition

Hi Amit_Ruhela,

This is most likely unrelated to the original post. Here, I’m guessing that the pgaccelinfo utility can’t find the “libcuda.so” (the CUDA Driver runtime) library.

By “loading cuda/10.1”, I’m assuming you’re loading a new module? Does this reset just the CUDA being used or also the CUDA Driver? I would think it would just be CUDA since the driver is typically installed in the system /usr/lib64 directories, but you may need to talk with your system admin to know. It may be as simple as adding the location of the CUDA driver library to the LD_LIBRARY_PATH, but I’m not sure.

-Mat