-ta nvidia,host and libcuda.so requirement

Hi,

I was trying version 10.0 to create unified binaries that run with
or without accelarator.
It seems however, that libcuda.so is required in any case (even
with ACC_DEVICE=host) and must be installed also on platforms
without accelerator.
Also libcuda.so is not shipped with the compiler. Under openSUSE,
for instance, it’s part of the video driver package, which would not
usually be installed without accelerator.

Is this the intended behaviour ? Or shouldn’t the runtime system
try to avoid using any CUDA libraries when apparently no
accelerator is present or wanted.

Also, copying libcuda.so to some place listed in $LD_LIBRARY_PATH
does not help, I then get the error
call to cuInit returned error 100: No device


Regards,
Norbert

Hi Norbet,

I just double checked, and did not have any problems when I created a unified binary on a system with a NVIDIA GPU and then ran it on another without a GPU. I was only able to recreate your error when I compiled with just “-ta=nvidia”. Can you please double check that you compiled with “-ta=nvidia,host”?

Thanks,
Mat

Ok, one by one. I was actually using one of the official examples:

hostA% pgaccelinfo | grep ‘Device Name’
Device Name: Tesla C1060
Device Name: Tesla C1060
hostA% pgfortran -o f2.uni f2.f90 -ta=nvidia,host -Minfo -fast
main:
1, PGI Unified Binary version for -tp=nehalem-64 -ta=host
20, Unrolled inner loop 8 times
26, Generated an alternate loop for the loop
Generated vector sse code for the loop
Generated a prefetch instruction for the loop
32, Generated an alternate loop for the loop
Generated vector sse code for the loop
Generated a prefetch instruction for the loop
38, Loop not vectorized/parallelized: contains call
main:
1, PGI Unified Binary version for -tp=nehalem-64 -ta=nvidia
20, Unrolled inner loop 8 times
25, Generating copyin(a(1:n))
Generating copyout(r(1:n))
26, Loop is parallelizable
Accelerator kernel generated
26, !$acc do parallel, vector(256)
32, Generated an alternate loop for the loop
Generated vector sse code for the loop
Generated a prefetch instruction for the loop
38, Loop not vectorized/parallelized: contains call
hostA% ./f2.uni
100000 iterations completed
1230 microseconds on GPU
1482 microseconds on host

Now hostB, same directory, same environment but no cuda installed (and no accelerator HW):

hostB% pgaccelinfo | grep ‘Device Name’
hostB% ./f2.uni
libcuda.so not found, exiting
hostB% ACC_DEVICE=host ./f2.uni
libcuda.so not found, exiting

At this point I noticed that I had not mentioned fortran in my
original post and Mat is probably using C.
So same test with a C example:


hostB% ./c2.uni
100000 iterations completed
1546 microseconds on GPU
1530 microseconds on host
hostB%

Aaaah. so it’s probably a fortran runtime problem.
Also interesting: hostB reports having spent some time on the non-existant GPU. But that’s off-topic.

Norbert

Hi Norbert,

The example code you are using has the following line:

  call acc_init( acc_device_nvidia )

In other words, by using this runtime call, the code is forcing the use of the NVIDIA device. Changing acc_init to use “acc_device_default” will allow you to use the unified binary.

Note that the c2 C example has the same issue.

Hope this helps,
Mat