calling cublas from parallel regions.

Hello all,
I am trying to call cuBLAS library device routines from openacc parallel regions in C. My proof-of concept code is not compiling, and my trials has lead me to the following information and code :

  • I understand that i need to link to the cublas_device and cudadevrt.
    linking to cublas_device only works if I use the second API version of cublas.
    the second API version of cublas needs declaring a cublascontext handle. :
    However, the use of cublas context handle is generating errors during compilation.

here is my current code.the compilation output follows that :

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

#include <cuda_runtime.h>
#include <cublas_v2.h>

#pragma acc routine (cublasSaxpy) seq
#pragma acc routine (cublasCreate) seq
#pragma acc routine (cublasDestroy) seq

int main(int argc, char **argv) {

	float *x, *y, alpha = 2, *ptr_alpha;
	ptr_alpha = &alpha;
	int n = 1 << 20, i;

	x = (float*) malloc(n * sizeof(float));

	y = (float*) malloc(n * sizeof(float));

#pragma acc data create(x[0:n]) copyout(y[0:n]) copyin(ptr_alpha[0:1])
#pragma acc kernels
#pragma acc loop independent
			for (i = 0; i < n; i++) {
				x[i] = 1.0f;
				y[i] = 0.0f;

#pragma acc parallel num_gangs(1)
			cublasHandle_t cnpHandle;
			int status;

			status = cublasCreate(&cnpHandle);

			if (CUBLAS_STATUS_SUCCESS == status) {
				/* Perform operation using cublas */
				cublasSaxpy(cnpHandle, n, ptr_alpha, x, 1, y, 1);

	fprintf(stdout, "y[0] = %f\n", y[0]);
	return 0;

the output I am getting from pgi is :

PGCC-S-0107-Struct or union cublasContext not yet defined (callcublas3.c: 16)
PGCC-S-0155-Cannot determine bounds for array cnpHandle (callcublas3.c: 47)
PGCC-S-0155-Cannot determine bounds for array cnpHandle (callcublas3.c: 47)
PGCC-S-0155-Cannot determine bounds for array cnpHandle (callcublas3.c: 47)

The compilation command I am using is :

pgc++ -Minfo=all -Mcuda  -ta=tesla:cc35,cuda5.5 -I ~/installed/pgi/linux86-64/2014/cuda/5.5/include  -L ~/installed/pgi/linux86-64/2014/cuda/5.5/lib64 callcublas3-2.c -lcublas_device -lcudadevrt

I was able to eliminate the error about cnpHandle bounds by defining the handle outside the ACC data regeion and copying it using [:1] . but i couldn’t fix the error about the cublas contex yet . it also seems strange to me that pgi is trying to copy the cnpHandle as it is being declared inside the parallel region.

is it possible to get around this error ?.

I am thinking that maybe compiling a wrapper device function via nvcc and calling it from ACC would work. but I am wondering if it can be done plainly in that code without a wrapper.

Thank you.

Hi GPGPU is good,

Having the ability to call the cuBlas device routines (as well as other CUDA C device routines) has been a goal for some time now and one of the reasons the “routine” pragma was created. However, “routine” is very new and we only have basic support available. 14.7 will expand this support but I think it will be a bit longer before we can get to the point where we can get your example to work as is.

I added TPR#20600 to track your example and sent it on to engineering.


Support for calling cuBLAS device routines from OpenACC compute regions is now in PGI 15.7. See the examples under the 2015 directory, in the CUDA-Libraries sub-directory.