Compile and linke OpenACC and CUDA in one program

StefanRos · December 5, 2016, 1:38pm

Hi all,

I try to use a (self written) CUDA function in a program with OpenACC.
Unfortunately is every online-help not useful :-(

I wrote the following functions (a hello world test):

saxpy.cu:

#include <stdio.h>

extern const int N;
extern const int blocksize;

global
void hello(char *a, int *b)
{
a[threadIdx.x] += b[threadIdx.x];
}

global
void saxpy(const int size_vec, double *factor_1, double *factor_2, double result)
{
int ii = blockIdx.xblockDim.x + threadIdx.x;
if (ii < size_vec) result[ii] *= factor_1[ii] * factor_2[ii];
}

void test(int size_vec, int _factor_1_size, int _factor_2_size, int _result_size, double *_factor_1_data, double *_factor_2_data, double *_result_data)
{
double * d_factor_1_data;
double * d_factor_2_data;
double * d_result_data;

const int factor_1_size = _factor_1_sizesizeof(double);
const int factor_2_size = _factor_2_sizesizeof(double);
const int result_size = _result_size*sizeof(double);

cudaMalloc( (void**)&d_factor_1_data, factor_1_size);
cudaMalloc( (void**)&d_factor_2_data, factor_2_size);
cudaMalloc( (void**)&d_result_data, result_size);
cudaMemcpy( d_factor_1_data, _factor_1_data, factor_1_size, cudaMemcpyHostToDevice );
cudaMemcpy( d_factor_2_data, _factor_2_data, factor_2_size, cudaMemcpyHostToDevice );
cudaMemcpy( d_result_data, _result_data, _result_size, cudaMemcpyHostToDevice );

int* _size_vec;
cudaMalloc((void**)&_size_vec, sizeof(int));
cudaMemcpy(_size_vec, &size_vec, sizeof(int), cudaMemcpyHostToDevice);

saxpy<<<(size_vec+255)/256, 256>>>(size_vec, d_factor_1_data, d_factor_2_data, d_result_data);

//cout << "a: " << _result_data[10] << endl;
cudaMemcpy( _result_data, d_result_data, _result_size, cudaMemcpyDeviceToHost );
//cout << "b: " << _result_data[10] << endl;

cudaFree(d_factor_1_data);
cudaFree(d_factor_2_data);
cudaFree(d_result_data);
}

and:
main.cpp:

#include <stdio.h>

#include
#include
#include
#include
#include
#include
#include

#include “/home/rosenbs/src/carp-dcse-pt/branches/mechanics/PT_C/toolbox/vector/toolbox_vector.h”

void test(int size_vec, int _factor_1_size, int _factor_2_size, int _result_size, double *_factor_1_data, double *_factor_2_data, double *_result_data);

const int N = 16;
const int blocksize = 16;

using namespace std;

int main(){
int size_vec = 10000;
toolbox_vector result(size_vec);
toolbox_vector factor_1(size_vec);
toolbox_vector factor_2(size_vec);

for(int ii=0; ii<size_vec; ii++){
factor_1[ii] = ii+1;
factor_2[ii] = 2.0/(ii+1);
result[ii] = -123.0;
}

double * _factor_1_data = factor_1.data();
double * _factor_2_data = factor_2.data();
double * _result_data = result.data();

const int _factor_1_size = factor_1.size();
const int _factor_2_size = factor_2.size();
const int _result_size = result.size();

cout << "Test A: " << _result_data[10] << endl;

test(size_vec, _factor_1_size, _factor_2_size, _result_size, _factor_1_data, _factor_2_data, _result_data);

cout << "Test B: " << _result_data[10] << endl;
return 0;
}

Remark: A toolbox_vector is (more or less) a self written std-vector.

If I try to compile and link the program, I use the following commands:

→ nvcc -O3 --compile -x c++ -o main.o main.cpp
→ nvcc -c saxpy.cu
→ nvcc --cudart static -gencode arch=compute_52,code=compute_52 -gencode arch=compute_52,code=sm_52 -link -o “test” main.o saxpy.o

works perfect (but only with CUDA).

If I try the following:

pgc++ -O3 -c -x c++ -o main.o main.cpp
nvcc -c saxpy.cu
(That works, with a warning).

But If I link I get for nvcc
→ nvcc --cudart static -gencode arch=compute_52,code=compute_52 -gencode arch=compute_52,code=sm_52 -link -o “test” main.o saxpy.o
the Error “undefined reference to `__c_mcopy8’”

And for pgc++
→ pgc++ arch=compute_52,code=compute_52 arch=compute_52,code=sm_52 -o “test” main.o saxpy.o+:
the Errpr: undefined reference to cudaMalloc' \ \ OK, obvious, missing library ....... But I can't find, which library I have to add (or which path to the librarys) that pgc++ can compile the Cuda functions. \ \ I also tried: --> pgc++ -O3 -fPIC -c -x c++ -o main.o main.cpp --> nvcc -c saxpy.cu --> pgc++ -fPIC -ta=nvidia:cuda8.0 -shared -o stats.so -o "test" main.o saxpy.o which leads to the Error: /usr/bin/ld: saxpy.o: relocation R_X86_64_32 against .bss’ can not be used when making a shared object; recompile with -fPIC

Can anyone tell me, how I have to link my program to execute a Cuda function with PGI on the device?
Unfortunately is the link https://devblogs.nvidia.com/parallelforall/3-versatile-openacc-interoperability-techniques/ not useful for this.
I would be really grateful :-/

Best,
Stefan

PS: I tried to use -Mcudax86, but this is not what I want. With this option PGI execute the Cuda function on the host, but I want it to be on the device.

MatColgrove · December 5, 2016, 7:10pm

Hi Stephan,

Assuming you have CUDA 7.5 or above, the simplest thing to do is to use the nvcc driver but tell it to use pgc++ as the host compiler via the “-ccbin” flag. By default, nvcc uses g++ as the host compiler. The “-Xcompiler” flag tells nvcc to pass these options to the host compiler.

Note that there’s an error in the CUDA 7.5 header files which only recognizes PGI version 15.4 as being supported. You’ll need to remove this check from the header file. CUDA 8.0 checks against any PGI 16.x version which is fine for now but will be a problem again next year. I’ll show what edits to make at the bottom of this post.

For example:

% nvcc -ccbin pgc++ -Xcompiler -w -Xcompiler -fast saxpy.cu main.cpp -o test1.out         
% ./test1.out
Test A: -123
Test B: -246

I also simplified your code to have the main program handle the data movement via OpenACC directives instead of “test”.

% cat saxpy_nodata.cu
#include <stdio.h>

 extern const int N;
 extern const int blocksize;

 __global__
 void hello(char *a, int *b)
 {
 a[threadIdx.x] += b[threadIdx.x];
 }

 __global__
 void saxpy(const int size_vec, double *factor_1, double *factor_2, double *result)
 {
 int ii = blockIdx.x*blockDim.x + threadIdx.x;
 if (ii < size_vec) result[ii] *= factor_1[ii] * factor_2[ii];
 }

 void test(int size_vec, int _factor_1_size, int _factor_2_size, int _result_size, double *_factor_1_data, double *_factor_2_data, double *_result_data)
 {
   saxpy<<<(size_vec+255)/256, 256>>>(size_vec, _factor_1_data, _factor_2_data, _result_data);
 }

% cat main.cpp
 #include <stdio.h>
 #include <iostream>
 #include <fstream>
 #include <cstring>
 #include <string>
 #include <sstream>
 #include <cmath>
 #include <algorithm>

 void test(int size_vec, int _factor_1_size, int _factor_2_size, int _result_size, double *_factor_1_data, double *_factor_2_data, double *_result_data);

 using namespace std;

 int main(){
 int size_vec = 10000;
 double * result = new double[size_vec];
 double * factor_1 = new double[size_vec];
 double * factor_2 = new double[size_vec];

 for(int ii=0; ii<size_vec; ii++){
 factor_1[ii] = ii+1;
 factor_2[ii] = 2.0/(ii+1);
 result[ii] = -123.0;
 }

 cout << "Test A: " << result[10] << endl;

#pragma acc data copyin(factor_1[0:size_vec], factor_2[0:size_vec]), copy(result[0:size_vec])
{
  #pragma acc host_data use_device(factor_1, factor_2, result)
  {
    test(size_vec, size_vec, size_vec, size_vec, factor_1, factor_2, result);
  }
}

 cout << "Test B: " << result[10] << endl;
 return 0;
 }

% nvcc -ccbin pgc++ -Xcompiler -w -Xcompiler -fast -Xcompiler -acc -Xcompiler -ta=tesla:cc50 -Xcompiler -ta=tesla:cuda7.5 -Xcompiler -Minfo=accel  saxpy_nodata.cu main.cpp -o test1.out
main:
     29, Generating copyin(factor_1[:size_vec],factor_2[:size_vec])
         Generating copy(result[:size_vec])
% ./test1.out
Test A: -123
Test B: -246

As for separate compilation with linking with pgc++, add the “-Mcuda” flag to bring in the correct libraries.

% nvcc -c saxpy.cu
% pgc++ -O3 -Mcuda=7.5 main.cpp saxpy.o
main.cpp:
% ./a.out
Test A: -123
Test B: -246

Are really trying to create a shared object? Without “-shared”, the link compiles fine. The relocation truncation error is because you need to compile “saxpy.cu” with position independent code. Unfortunately, I don’t know nvcc option for this but can find out if needed. Given you have a main routine, I’m not sure if this is really what you’re intending.

Note that if you want to put OpenACC code inside a shared object, be sure to compile without RDC (-ta=tesla:nordc). RDC requires an extra link step that is not available with a .so.

% nvcc -c saxpy.cu
% pgc++ -fPIC -O3 -c main.cpp
% nvcc -o test2.out saxpy.o main.o
% ./test2.out
Test A: -123
Test B: -246

Here’s the change you need to the CUDA 7.5 header to convince them to use a PGI compiler other than 15.4.

Change around line 86 of “include/host_config.h” from:

#if defined(__PGIC__)

#if __PGIC__ != 15 || __PGIC_MINOR__ != 4 || !defined(__GNUC__) || !defined(__LP64__)

#error -- unsupported pgc++ configuration! Only pgc++ 15.4 on Linux x86_64 is supported!

#endif /* __PGIC__ != 15 || __PGIC_MINOR != 4 || !__GNUC__ || !__LP64__ */

to

#if defined(__PGIC__)

#if !defined(__GNUC__) || !defined(__LP64__)

#error -- unsupported pgc++ configuration! Only pgc++ 15.4 on Linux x86_64 is supported!

#endif /* __PGIC__ != 15 || __PGIC_MINOR != 4 || !__GNUC__ || !__LP64__ */

Hope this helps,
Mat

StefanRos · December 6, 2016, 8:02am

Hi Mat,

to be honest, that helps a LOT!
Indeed, you have foreseen my next step, to use OpenACC for the data-movement, and execute a corresponding CUDA kernel!

Thank you very much,
Stefan!

Topic		Replies	Views
OpenACC-CUDA interoperability within the same file Legacy PGI Compilers	4	4137	November 4, 2016
pgcc compile error in OpenACC-CUDA interoperabily example Legacy PGI Compilers	5	13274	June 14, 2016
Reference makfile for MPI+OpenACC+CUDA code Legacy PGI Compilers	2	2711	April 24, 2019
MPICH linking failing Legacy PGI Compilers	12	12599	October 25, 2013
how to compil CUDA device functions Legacy PGI Compilers	10	5026	August 29, 2018
NVCC forces c++ compilation of .cu files CUDA Programming and Performance	11	25657	December 11, 2011
Using PGI as a host compiler CUDA Programming and Performance	6	2887	April 18, 2016
Ubuntu 20.04, GCC 9.3, Cuda Toolkit 11.3 - not a supported combination? CUDA Programming and Performance	11	8961	November 4, 2021
Linking C and CUDA files with NVCC AND GCC CUDA Programming and Performance	9	41327	February 21, 2011
Shared library creation? CUDA Programming and Performance	12	30831	April 5, 2016

Compile and linke OpenACC and CUDA in one program

Related topics