pgcc compile error in OpenACC-CUDA interoperabily example

Hello, I have written a code to show OpenACC-CUDA interoperability and have some problems with compilation, could you please help me in finding problem?

#include <stdio.h>
#include <cuda.h>
#include <curand_kernel.h>

static const int N=100;

void vecAdd (float restrict *a, float restrict *b, float restrict*sum)
{
	#pragma acc kernels loop present (a[N], b[N], sum[N]);
	for (int i=0; i<N; i++)
		sum[i]=a[i]+b[i];
}

__global__ void setup_kernel ( curandState * state, unsigned long seed )
{
	int id = threadIdx.x + blockIdx.x * 64;
	if (id<N){
		//seed, sequence, offset, state
		curand_init ( seed, id, 0, &state[id] );
	}
}

__global__ void generate( curandState* globalState, float * array1, float * array2, float * array3 ) 
{
	int ind = threadIdx.x;
	if (ind <N){
		curandState localState = globalState[ind];
		float RANDOM = curand_uniform( &localState );
		array1[ind] = RANDOM;
		array2[ind] = (RANDOM+5)/123;
		array3[ind] = RANDOM+2;
		globalState[ind] = localState;
	}
}

int main()
{
      	curandState* devStates;
	float * a, * b, * sum, * vec;
	
	cudaMalloc ( &a, N*sizeof(float));
	cudaMalloc ( &b, N*sizeof(float));
	cudaMalloc ( &sum, N*sizeof(float));
	cudaMalloc ( &vec, N*sizeof(float));
	cudaMalloc ( &devStates, N*sizeof( curandState ));

	setup_kernel <<< N/256+1, 256 >>> (a, time(NULL));
	generate <<< N/256+1, 256 >>> ( devStates, a, b );

	#pragma acc declare device_resident (a[N], b[N], sum[N], vec[N])
	vecAdd (a,b,sum);

	cublasInit();
	cublasSaxpy(N, 2.0, sum, 2, vec, 1);
	cublasShutdown();
	
	#pragma acc host_data use_device (sum);
		
	for (int i=0; i<N; i++)
		printf(sum[i],"\n");
	return 0;
}

I compile it with:

pgcc -acc -I/opt/pgi/linux86-64/2012/cuda/4.2/include -Minfo=accel -L /opt/pgi/linux86-64/2012/cuda/4.2/lib64 -lcurand -ta=nvidia interop.c

And recieve the following error:

PGC-F-0249-#error --  --- !!! UNKNOWN COMPILER: please provide a CUDA compatible definition for '__align__' !!! --- (/opt/pgi/linux86-64/2012/cuda/4.2/include/host_defines.h: 128)
PGC/x86-64 Linux 12.10-0: compilation aborted

Thanks a lot!

Hi i_alex2004,

Unfortunately, NVIDIA hasn’t updated their header files to allow pgcc to compile them. Also, pgcc doesn’t support CUDA C extensions. Our C++ compiler, pgcpp, does but only when targeting x86, not NVIDIA GPUs.

What this means is that you need to compile your CUDA C code with nvcc, and OpenACC code with PGI, and the two can’t be mixed in the same file. The objects and device pointers are interoperable.

Hope this helps,
Mat

Hi mkcolg,

Is there any updates on this feature?

I also want to do some interoperability thing between CUDA and OpenACC but since I am compiling a big package with a CMake, I can’t specify compilers (nvcc or pgcc/pgc++) per source file. Also, there are some files in the package that are suffixed as .c or .cpp but they are using CUDA features.

So, in a nutshell, I need interoperability feature of PGI compiler but I need the compiler to recognize it and do it manually.

Do you know whether there is a plan for this or not?

I am using PGI Compiler 16.5 (trial) and NVCC 7.0/7.5.


Regards,
Millad

Hi Millad,

To intermix both CUDA 7.5 and OpenACC in the same source file, use nvcc as the command line compiler and pgc++ as the host compiler.

Note that CUDA 7.5 contains an error in the “/opt/cuda-7.5/include/host_config.h” header file where it restricts usage to just PGI 15.4. You will need to edit the file at line 87 to remove this check.
Change:

#if __PGIC__ != 15 || __PGIC_MINOR__ != 4 || !defined(__GNUC__) || !defined(__LP64__)

#error -- unsupported pgc++ configuration! Only pgc++ 15.4 on Linux x86_64 is supported!

#endif /

to

#if !defined(__GNUC__) || !defined(__LP64__)

#error -- unsupported pgc++ configuration! Only pgc++ 15.4 on Linux x86_64 is supported!

Here’s an example which I derived from the CUDA vectorAdd sample:

% cat vectorAdd.cpp
#include <stdio.h>
#include <cuda_runtime.h>
__global__ void
vectorAdd(const float *A, const float *B, float *C, int numElements)
{
    int i = blockDim.x * blockIdx.x + threadIdx.x;

    if (i < numElements)
    {
        C[i] = A[i] + B[i];
    }
}

/**
 * Host main routine
 */
int
main(void)
{
    cudaError_t err = cudaSuccess;
    int numElements = 50000;
    size_t size = numElements * sizeof(float);
    printf("[Vector addition of %d elements]\n", numElements);
    float *h_A = (float *)malloc(size);
    float *h_B = (float *)malloc(size);
    float *h_C = (float *)malloc(size);
    for (int i = 0; i < numElements; ++i)
    {
        h_A[i] = rand()/(float)RAND_MAX;
        h_B[i] = rand()/(float)RAND_MAX;
    }

#pragma acc data copyin(h_A[0:numElements],h_B[0:numElements]), copyout(h_C[0:numElements])
{
#pragma acc host_data use_device(h_A,h_B,h_C)
{
    // Launch the Vector Add CUDA Kernel
    int threadsPerBlock = 256;
    int blocksPerGrid =(numElements + threadsPerBlock - 1) / threadsPerBlock;
    printf("CUDA kernel launch with %d blocks of %d threads\n", blocksPerGrid, threadsPerBlock);
    vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(h_A, h_B, h_C, numElements);
    err = cudaGetLastError();
    if (err != cudaSuccess)
    {
        fprintf(stderr, "Failed to launch vectorAdd kernel (error code %s)!\n", cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }
} // end host_data
} // end acc data region
    for (int i = 0; i < numElements; ++i)
    {
        if (fabs(h_A[i] + h_B[i] - h_C[i]) > 1e-5)
        {
            fprintf(stderr, "Result verification failed at element %d!\n", i);
            exit(EXIT_FAILURE);
        }
    }

    printf("Test PASSED\n");

    // Free host memory
    free(h_A);
    free(h_B);
    free(h_C);

    printf("Done\n");
    return 0;
}

% nvcc -x cu -ccbin pgc++ -Xcompiler " -w -ta=tesla:cuda7.5 -Mcuda -V16.5 -Minfo=accel" vectorAdd.cpp
main:
      1, include "tmpxft_00008abb_00000000-4_vectorAdd.cudafe1.cpp"
           3, include "vectorAdd.cu"
               60, include "device_types.h"
                    69, include "builtin_types.h"
                         65, include "host_runtime.h"
                              72, include "stddef.h"
                                  214, include "driver_types.h"
                                      1445, include "surface_types.h"
                                            114, include "texture_types.h"
                                                 208, include "vector_types.h"
                                                       32, include "vectorAdd.cu"
                                                            34, Generating copyin(h_A[:numElements],h_B[:numElements])
                                                                Generating copyout(h_C[:numElements])
% a.out
[Vector addition of 50000 elements]
CUDA kernel launch with 196 blocks of 256 threads
Test PASSED
Done

Hope this helps,
Mat

Thanks Mat for reply.

It worked. I was able to compile the file that I had a problem with.

However, when I try to compile the whole package by setting nvcc as the main compiler for C/C++ files, CMake gives an error to me that CMAKE_C_COMPILER is set to a C++ compiler. I don’t think that a C version of nvcc exists, right?

Actually, I am trying to compile GROMACS with PGI. The last version of it relies on CMake to build the system.

I also have a problem on how to pass argument after -Xcompiler option when using CMake. Since it is in double quotation, I am worried about how to pass it to -DCMAKE_C_FLAGS variable of CMake.

P.S.: thanks for the hint on “host_config.h”. It helped. I hesitate to change header files like since I think that I might be the one that handles the situation wrong and not the global header files like host_config.

I don’t think that a C version of nvcc exists, right?

Correct, nvcc is a C++ compiler. I’m not sure how to help here except to use pgcc for the C compiler and not use CUDA constructs in your C source.

Since it is in double quotation, I am worried about how to pass it to -DCMAKE_C_FLAGS variable of CMake.

While a bit more cumbersome, you can break-up the line into multiple -Xcompiler flags:

nvcc -x cu -ccbin pgc++ -Xcompiler -ta=tesla:cuda7.5 -Xcompiler -Mcuda  -Xcompiler -Minfo=accel vectorAdd.cpp
  • Mat