Issue linking BLAS

Using PGI 19.4 on Ubuntu 18.04.

When I try to link -lblas I get the following error

error while loading shared libraries: libpgatm.so: cannot open shared object file: No such file or directory

I can fix by adding to path with

export LD_LIBRARY_PATH=/opt/pgi/linux86-64-llvm/19.4/lib${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

But if I want to use GPU counters (PGI_ACC_TIME=1) with driver 418.67 (CUDA 10.1.168), I need to run my application with sudo.

This causes the original error.

Code is at https://github.com/mnicely/computeWorks_examples/tree/master/computeWorks_mm. You’ll need to change -lopenblas -> -lblas in makefile.

Any ideas?

Hi Matt,

The problem being that “-Bstatic_pgi” only links the PGI runtime statically, but not libblas. However since the shared blas library, “libblas.so”, was dynamically linked against PGI’s shared libraries, it’s bringing in these references. Unfortunately, you can’t use just “-Bstatic” since there are few libraries, like libcuda.so, that don’t have static versions. The solution is to add in some link time options so that the static version of libblas is used.

It’s a bit tricky since you’re using nvcc, but I as able to get this to work by using the following options for “LIBS” in your makefile:

LIBS   := -lcublas -Xlinker "-Bstatic" -Xlinker "-lblas" -Xlinker "-Bdynamic"



% make
Building target: computeWorks_mm
nvcc -x cu -ccbin pgc++ -O2 -L/proj/pgi/linux86-64-llvm/19.4/lib/ -lcublas -Xlinker "-Bstatic" -Xlinker "-lblas" -Xlinker "-Bdynamic"  -Xcompiler "-V19.4 -Bstatic_pgi -acc -mp -ta=tesla:nordc -Mcuda -Minfo=accel" -gencode arch=compute_70,code=sm_70 -gencode arch=compute_70,code=compute_70 -o "computeWorks_mm" "../src/computeWorks_mm.cu"
openACC(int, float, const float *, const float *, float, float *, const int &):
      1, include "computeWorks_mm.cu"
         163, Generating copyin(A[:n*n])
              Generating copyout(C[:n*n])
              Generating copyin(B[:n*n])
      1, include "computeWorks_mm.cu"
         167, Loop is parallelizable
      1, include "computeWorks_mm.cu"
         169, Loop is parallelizable
              Generating Tesla code
             167, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
             169, #pragma acc loop gang /* blockIdx.y */
             172, #pragma acc loop seq
      1, include "computeWorks_mm.cu"
         172, Loop is parallelizable
std::chrono::duration<double, std::ratio<(long)1, (long)1000>>::duration<long, std::ratio<(long)1, (long)1000000000>, void>(const std::chrono::duration<T1, T2> &):
      1, include "computeWorks_mm.cu"
      1, include "computeWorks_mm.cu"
Finished building target: computeWorks_mm

% ldd computeWorks_mm
        linux-vdso.so.1 =>  (0x00007ffe48dcc000)
        libcublas.so.10 => /usr/lib64/libcublas.so.10 (0x00002b15af68c000)
        librt.so.1 => /usr/lib64/librt.so.1 (0x00002b15b3405000)
        libpthread.so.0 => /usr/lib64/libpthread.so.0 (0x00002b15b360d000)
        libdl.so.2 => /usr/lib64/libdl.so.2 (0x00002b15b3829000)
        libatomic.so.1 => /home/sw/thirdparty/gcc/gcc-8.2.0/linux86-64/lib/gcc/x86_64-pc-linux-gnu/8.2.0/../../../../lib64/libatomic.so.1 (0x00002b15b3a2e000)
        libstdc++.so.6 => /home/sw/thirdparty/gcc/gcc-8.2.0/linux86-64/lib/gcc/x86_64-pc-linux-gnu/8.2.0/../../../../lib64/libstdc++.so.6 (0x00002b15b3c35000)
        libm.so.6 => /usr/lib64/libm.so.6 (0x00002b15b3fc1000)
        libc.so.6 => /usr/lib64/libc.so.6 (0x00002b15b42c4000)
        /lib64/ld-linux-x86-64.so.2 (0x00005563a7073000)
        libgcc_s.so.1 => /home/sw/thirdparty/gcc/gcc-8.2.0/linux86-64/lib/gcc/x86_64-pc-linux-gnu/8.2.0/../../../../lib64/libgcc_s.so.1 (0x00002b15b4687000)
        libcublasLt.so.10 => /usr/lib64/libcublasLt.so.10 (0x00002b15b489e000)

Hope this helps,
Mat

Hey Mat,

Yes, I believe that did fix that issue!

I noticed you added the -Mcuda flag. Isn’t that just for Fortran?

Matt

No, not just for Fortran. It can be used in C/C++ when linking OpenACC with CUDA code so the CUDA libraries are brought into the link. We ask change some of the setup code that’s linked in so the runtime will check if the device data is coming from CUDA. Though since you’re using “nordc”, we’re not linking the device code, so adding “-Mcuda” probably doesn’t matter. I just added it as I was trying to find the right combinations of flag to get nvcc to statically link the BLAS library.