Problem getting thrust device functionality to work

dang.khoi137 · October 17, 2020, 3:54pm

I am using the Nvidia HPC SDK 20.9 package right now. I’m currently trying to use thrust on device to accelerate sorting. I’m unable to get even the simplest code which uses the thrust device functionalities to compile. Using nvc++ on the following code

#include <thrust/device_vector.h>

int main() {
  thrust::device_vector< double > v1(10);

  return 0;
}

results in the error below on compilation. Has anyone else run into this problem as well or have any ideas on how to get thrust device code to compile properly? I truncated some of the error message, but there are about 15 instantiation errors that are detected.

"/home/khoidang/.local/nvhpc/Linux_x86_64/20.9/cuda/includ
          e/thrust/system/detail/generic/for_each.h", line 66: error:
          incomplete type is not allowed
    THRUST_STATIC_ASSERT_MSG(
    ^
          detected during:
 .
 .
 .
            instantiation of "thrust::device_vector<T,
                      Alloc>::device_vector(thrust::device_vector<T,
                      Alloc>::size_type) [with T=double,
                      Alloc=thrust::device_allocator<double>]" at line 4 of
                      "test.cpp"

1 error detected in the compilation of "test.cpp".

mnicely · October 21, 2020, 3:32pm

What’s your compile instruction?

dang.khoi137 · October 21, 2020, 3:40pm

I’m using nvc++ test.cpp -o test.exe

mnicely · October 21, 2020, 4:18pm

You can either do

nvcc test.cu -o test.exe
or
nvcc -x cu test.cpp -o test.exe

dang.khoi137 · October 21, 2020, 5:13pm

Ok thanks. This works to compile the simple test case. Now I’m trying to use thrust with openACC in a similar manner to this example (accelerator_interoperability/Hash at master · olcf/accelerator_interoperability · GitHub) where the thrust code is inside a wrapper function in a separate file and the call to the wrapper function occurs inside a #pragma acc parallel region, except I am using a .cpp instead of a .c file.

I can use nvcc -c sortGPU.cu and nvc++ -c sort.cpp to obtain the object code successfully, but am unable to link the sortGPU.o and sort.o successfully. When using nvc++ to link, I get a series of undefined reference errors:

tmpxft_00003171_00000000-6_gpu.cudafe1.cpp:(.text._ZN3cub11EmptyKernelIvEEvv[_ZN3cub11EmptyKernelIvEEvv]+0x54): undefined reference to `__cudaPopCallConfiguration'
tmpxft_00003171_00000000-6_gpu.cudafe1.cpp:(.text._ZN3cub11EmptyKernelIvEEvv[_ZN3cub11EmptyKernelIvEEvv]+0x99): undefined reference to `cudaLaunchKernel'
.
.
.
/home/khoidang/gpu/main.cpp:50: undefined reference to `__pgi_uacc_dataenterstart2'
/home/khoidang/gpu/main.cpp:63: undefined reference to `__pgi_uacc_dataoffb2'
.
.
.
/home/khoidang/gpu/main.cpp:63: undefined reference to `__pgi_uacc_dataexitdone'

The cuda and pgi errors both occur when using nvc++ to link while the pgi errors only occur when using nvcc. Do you know what the proper way to link this kind of example is?

mnicely · October 21, 2020, 5:20pm

wrapper function in a separate file
You might need to add -rdc for relocatable code

Are you trying to call a thrust function inside a #pragma acc parallel region?
I’m not sure that’s possible in every case, because you’re asking each thread launched on the GPU to run that thrust function.

dang.khoi137 · October 21, 2020, 5:37pm

Sorry I mistyped. I meant to write #pragma acc host_data use_device(x,y) region. Setting -rdc=true still leaves me with the same undefined reference errors.