Hello mat, I would like to compare the protocol speed of OpenACC with the sum speed of the cublas matrix, but the following error occurred. I believe I mistakenly used the cublasScasum function, but I have checked the relevant documentation and I am not sure where the error lies. Can you please indicate the problem to me?
int n; /* size of the vector */
int i;
timestruct t1, t2, t3;
long long t_acc, t_blas;
if( argc > 1 )
n = atoi( argv[1] );
else
n = 100000;
if( n <= 0 ) n = 100000;
float a_acc={0} ;
float a_blas={0};
cuComplex* e_count = (cuComplex*)malloc(n*sizeof(cuComplex));
const int incx=1;
for( i = 0; i < n; ++i )
{
e_count[i].x = 1.0;
e_count[i].y = 1.0;
}
#pragma acc data copyin(e_count[0:n],n,a_acc,a_blas)
{
gettime( &t1 );
#pragma acc parallel reduction(+:a_acc)
for( i = 0; i < n; ++i ){
a_acc=a_acc+e_count[i].x+e_count[i].y;
}
gettime( &t2 );
#pragma acc host_data use_device(e_count,a_blas)
{
cublasHandle_t handle;
cublasCreate(&handle);
cublasScasum(handle,n,e_count,incx,&a_blas);
}
gettime( &t3 );
}
t_acc = usec(t1,t2);
t_blas = usec(t2,t3);
// if(a_acc!=a_blas) printf( "Test FAILED\n");
/* check the results */
printf( "%13d iterations completed\n", n );
printf( "%13ld microseconds on OpenACC\n",t_acc );
printf( "%13ld microseconds on Cublas\n", t_blas);
What’s the error you’re seeing? Which cuBLAS header file are you using?
There’s two different header files for cuBLAS, you’re using the “cublas_v2.h” interface. If you’re including “cublas.h”, you’ll get an error saying that there are too many arguments.
Your code will also segv because you’re dereferencing a device variable, “a_blas” on the host. I’d remove a_blas from the host_data directive and pass in the host pointer.
Since you didn’t include your header files, I wasn’t sure which timer you’re using, so took this out. Here’s what seems to work for me:
% cat test.c
#include <stdio.h>
#include <stdlib.h>
#include <cublas_v2.h>
int main(int argc, char * argv[]) {
int n; /* size of the vector */
int i;
if( argc > 1 )
n = atoi( argv[1] );
else
n = 100000;
if( n <= 0 ) n = 100000;
float a_acc={0} ;
float a_blas={0};
cuComplex* e_count = (cuComplex*)malloc(n*sizeof(cuComplex));
const int incx=1;
for( i = 0; i < n; ++i )
{
e_count[i].x = 1.0;
e_count[i].y = 1.0;
}
#pragma acc data copyin(e_count[0:n],n)
{
#pragma acc parallel reduction(+:a_acc)
for( i = 0; i < n; ++i ){
a_acc=a_acc+e_count[i].x+e_count[i].y;
}
#pragma acc host_data use_device(e_count)
{
cublasHandle_t handle;
cublasCreate(&handle);
cublasScasum(handle,n,e_count,incx,&a_blas);
}
}
/* check the results */
if(a_acc!=a_blas) printf( "Test FAILED\n");
printf( "%13d iterations completed\n", n );
}
% nvc -cuda -cudalib=cublas -acc test.c -w -fast ; a.out
100000 iterations completed
Your format is incorrect given these variables are floats not long integers. Using “f” instead if “ld” will fix the “a_acc” issue.
Your “a_blas” will be incorrect since you’re over writing the host value with the device value which is uninitialized. You should remove “a_blas” from the copy clause.
To illustrate:
This creates an uninitialized “a_blas” on the device
#pragma acc data copyin(e_count[0:n],n) copyout(a_acc,a_blas)
Here the host copy of “a_blas” is returned from the cuBlas routine
cublasScasum(handle,n,e_count,incx,&a_blas);
At the end of the data region, the device copy of a_blas is copied back to the host overriding the result from cuBlas.