Hello all,
I am trying to call cuBLAS library device routines from openacc parallel regions in C. My proof-of concept code is not compiling, and my trials has lead me to the following information and code :
- I understand that i need to link to the cublas_device and cudadevrt.
linking to cublas_device only works if I use the second API version of cublas.
the second API version of cublas needs declaring a cublascontext handle. :
However, the use of cublas context handle is generating errors during compilation.
here is my current code.the compilation output follows that :
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <cuda_runtime.h>
#include <cublas_v2.h>
#pragma acc routine (cublasSaxpy) seq
#pragma acc routine (cublasCreate) seq
#pragma acc routine (cublasDestroy) seq
int main(int argc, char **argv) {
float *x, *y, alpha = 2, *ptr_alpha;
ptr_alpha = α
int n = 1 << 20, i;
x = (float*) malloc(n * sizeof(float));
y = (float*) malloc(n * sizeof(float));
#pragma acc data create(x[0:n]) copyout(y[0:n]) copyin(ptr_alpha[0:1])
{
#pragma acc kernels
{
#pragma acc loop independent
for (i = 0; i < n; i++) {
x[i] = 1.0f;
y[i] = 0.0f;
}
}
#pragma acc parallel num_gangs(1)
{
cublasHandle_t cnpHandle;
int status;
status = cublasCreate(&cnpHandle);
if (CUBLAS_STATUS_SUCCESS == status) {
/* Perform operation using cublas */
cublasSaxpy(cnpHandle, n, ptr_alpha, x, 1, y, 1);
cublasDestroy(cnpHandle);
}
}
}
fprintf(stdout, "y[0] = %f\n", y[0]);
free(x);
free(y);
return 0;
}
the output I am getting from pgi is :
PGCC-S-0107-Struct or union cublasContext not yet defined (callcublas3.c: 16)
PGCC-S-0155-Cannot determine bounds for array cnpHandle (callcublas3.c: 47)
PGCC-S-0155-Cannot determine bounds for array cnpHandle (callcublas3.c: 47)
PGCC-S-0155-Cannot determine bounds for array cnpHandle (callcublas3.c: 47)
The compilation command I am using is :
pgc++ -Minfo=all -Mcuda -ta=tesla:cc35,cuda5.5 -I ~/installed/pgi/linux86-64/2014/cuda/5.5/include -L ~/installed/pgi/linux86-64/2014/cuda/5.5/lib64 callcublas3-2.c -lcublas_device -lcudadevrt
I was able to eliminate the error about cnpHandle bounds by defining the handle outside the ACC data regeion and copying it using [:1] . but i couldn’t fix the error about the cublas contex yet . it also seems strange to me that pgi is trying to copy the cnpHandle as it is being declared inside the parallel region.
is it possible to get around this error ?.
I am thinking that maybe compiling a wrapper device function via nvcc and calling it from ACC would work. but I am wondering if it can be done plainly in that code without a wrapper.
Thank you.