Hello Mat,
I have similar use case where I have to build the shared library containing global variables with declare create and I am seeing confusing behaviour. I am preparing below sample mimicking usage in our application with PGI 19.10:
$ pgc++ --version
pgc++ 19.10-0 LLVM 64-bit target on x86-64 Linux -tp skylake
Sample :
$ cat test.cpp
#include <stdlib.h>
#include <stdio.h>
float* my_data;
#pragma acc declare create(my_data)
static double my_value = 1.0;
#pragma acc declare copyin(my_value)
double my_celsius;
#pragma acc declare create(my_celsius)
void init(double val) {
my_data = (float*) malloc(sizeof(float)*3);
my_data[0] = 1.1;
#pragma acc enter data copyin(my_data[0:3])
my_celsius = val;
#pragma acc update device(my_celsius)
}
#pragma acc routine seq
double foo() {
double y = 2.0;
// 2 + 5 + 1 = 8
return y + my_celsius + my_value;
}
void sample() {
#pragma acc kernels present(my_celsius)
{
double val = foo();
printf("\n VALUE :: %lf \n", val);
}
}
and driver program:
$ cat main.cpp
#include <iostream>
void init(double val);
void sample();
int main() {
init(5.0);
sample();
std::cout << "Done\n";
return 0;
}
I build this example in different ways:
$ cat build.sh
# Buil 1
pgc++ -acc -Minfo test.cpp main.cpp
./a.out
# Build 2
pgc++ -acc -ta=tesla:nordc -Minfo test.cpp -c -fPIC
pgc++ -acc -ta=tesla:nordc -shared -o test.so test.o
pgc++ -acc -ta=tesla:nordc main.cpp test.so
./a.out
# Build 3
pgc++ -acc -Minfo test.cpp -c -fPIC
pgc++ -acc -shared -o test.so test.o
pgc++ -acc main.cpp test.so
./a.out
and output we get is:
$ export LD_LIBRARY_PATH=`pwd`:$LD_LIBRARY_PATH
$ export PGI_ACC_NOTIFY=2
$ bash -x build.sh
+ pgc++ -acc -Minfo test.cpp main.cpp
test.cpp:
init(double):
18, Generating enter data copyin(my_data[:3])
20, Generating update device(my_celsius)
foo():
23, Generating acc routine seq
Generating Tesla code
sample():
31, Accelerator serial kernel generated
Generating Tesla code
main.cpp:
+ ./a.out
upload CUDA data file=/gpfs/bbp.cscs.ch/project/proj16/kumbhar/pramod_scratch/acc_lib_expr/test.cpp function=_Z4initd line=18 device=0 threadid=1 variable=_st__gpfs_bbp_cscs_ch_project_proj16_kumbhar_pramod_scratch_acc_lib_expr_test_cpp_my_value bytes=8
upload CUDA data file=/gpfs/bbp.cscs.ch/project/proj16/kumbhar/pramod_scratch/acc_lib_expr/test.cpp function=_Z4initd line=18 device=0 threadid=1 variable=.attach. bytes=8
upload CUDA data file=/gpfs/bbp.cscs.ch/project/proj16/kumbhar/pramod_scratch/acc_lib_expr/test.cpp function=_Z4initd line=18 device=0 threadid=1 variable=my_data bytes=12
upload CUDA data file=/gpfs/bbp.cscs.ch/project/proj16/kumbhar/pramod_scratch/acc_lib_expr/test.cpp function=_Z4initd line=20 device=0 threadid=1 variable=my_celsius bytes=8
VALUE :: 8.000000
Done
+ pgc++ -acc -ta=tesla:nordc -Minfo test.cpp -c -fPIC
init(double):
18, Generating enter data copyin(my_data[:3])
20, Generating update device(my_celsius)
foo():
23, Generating acc routine seq
Generating Tesla code
sample():
31, Accelerator serial kernel generated
Generating Tesla code
+ pgc++ -acc -ta=tesla:nordc -shared -o test.so test.o
+ pgc++ -acc -ta=tesla:nordc main.cpp test.so
main.cpp:
+ ./a.out
upload CUDA data file=/gpfs/bbp.cscs.ch/project/proj16/kumbhar/pramod_scratch/acc_lib_expr/test.cpp function=_Z4initd line=18 device=0 threadid=1 variable=my_data bytes=12
Present table dump for device[1]: NVIDIA Tesla GPU 0, compute capability 7.0, threadid=1
host:0x615230 device:0x7fff9fafa000 size:12 presentcount:0+1 line:18 name:my_data
allocated block device:0x7fff9fafa000 size:512 thread:1
FATAL ERROR: data in update device clause was not found on device 1: name=my_celsius
file:/gpfs/bbp.cscs.ch/project/proj16/kumbhar/pramod_scratch/acc_lib_expr/test.cpp _Z4initd line:20
+ pgc++ -acc -Minfo test.cpp -c -fPIC
init(double):
18, Generating enter data copyin(my_data[:3])
20, Generating update device(my_celsius)
foo():
23, Generating acc routine seq
Generating Tesla code
sample():
31, Accelerator serial kernel generated
Generating Tesla code
+ pgc++ -acc -shared -o test.so test.o
+ pgc++ -acc main.cpp test.so
main.cpp:
+ ./a.out
Failing in Thread:1
call to cuModuleGetGlobal returned error 500: Not found
In summary, this is what I am seeing:
- Build 1 : Everything works fine if we compile everything as single executable without any library (or create static library)
- Build 2 : If we create shared library with nordc then we get data in update device clause was not found… error. Why is that? How to get around this? If I comment out acc update device(my_celsius) in init() function then I don’t see this error. But I don’t understand without updating value on device, why this works. Is that because global variable gets copied on kernel launch? (but I don’t see that with PGI_ACC_DEBUG). (Edit : as mentioned in the next comment, this doesn’t work)
- Build 3 : After reading previous comments my impression was this should work fine without nordc using newer PGI version. But we get cuModuleGetGlobal returned error 500. This means we always have to use nordc with shared library?
Using shared library build is critical for our application and it’s difficult to get rid of global variables because the legacy code is auto-generated from some DSL layer.
If you have any suggestions to make Build 2 / Build 3 work, it will be great help!