Error using cublas on OpenACC

Hello mat, I would like to compare the protocol speed of OpenACC with the sum speed of the cublas matrix, but the following error occurred. I believe I mistakenly used the cublasScasum function, but I have checked the relevant documentation and I am not sure where the error lies. Can you please indicate the problem to me?

   int n;      /* size of the vector */
    int i;
    timestruct t1, t2, t3;
    long long t_acc, t_blas;
    if( argc > 1 )
	n = atoi( argv[1] );
    else
	n = 100000;
    if( n <= 0 ) n = 100000;


	float a_acc={0} ;
	float a_blas={0};
	cuComplex* e_count = (cuComplex*)malloc(n*sizeof(cuComplex));
	const int incx=1;  
    for( i = 0; i < n; ++i )
    {
     e_count[i].x = 1.0;
     e_count[i].y = 1.0;    
    }
 
    #pragma acc  data  copyin(e_count[0:n],n,a_acc,a_blas)
   {
    gettime( &t1 );
    #pragma acc parallel reduction(+:a_acc)
    for( i = 0; i < n; ++i ){
 	a_acc=a_acc+e_count[i].x+e_count[i].y;
    }
    gettime( &t2 );

   
    #pragma acc host_data use_device(e_count,a_blas)
    {
	cublasHandle_t handle;
	cublasCreate(&handle); 
  	cublasScasum(handle,n,e_count,incx,&a_blas);
    }	
    gettime( &t3 );
    }
     t_acc = usec(t1,t2);
     t_blas = usec(t2,t3);
    
//    if(a_acc!=a_blas) printf( "Test FAILED\n");

    /* check the results */

    printf( "%13d iterations completed\n", n );
    printf( "%13ld microseconds on OpenACC\n",t_acc );
    printf( "%13ld microseconds on Cublas\n", t_blas);

What’s the error you’re seeing? Which cuBLAS header file are you using?

There’s two different header files for cuBLAS, you’re using the “cublas_v2.h” interface. If you’re including “cublas.h”, you’ll get an error saying that there are too many arguments.

Your code will also segv because you’re dereferencing a device variable, “a_blas” on the host. I’d remove a_blas from the host_data directive and pass in the host pointer.

Since you didn’t include your header files, I wasn’t sure which timer you’re using, so took this out. Here’s what seems to work for me:

% cat test.c

#include <stdio.h>
#include <stdlib.h>
#include <cublas_v2.h>

int main(int argc, char * argv[]) {

    int n;      /* size of the vector */
    int i;
    if( argc > 1 )
        n = atoi( argv[1] );
    else
        n = 100000;
    if( n <= 0 ) n = 100000;


        float a_acc={0} ;
        float a_blas={0};
        cuComplex* e_count = (cuComplex*)malloc(n*sizeof(cuComplex));
        const int incx=1;
    for( i = 0; i < n; ++i )
    {
     e_count[i].x = 1.0;
     e_count[i].y = 1.0;
    }

    #pragma acc  data  copyin(e_count[0:n],n)
   {
    #pragma acc parallel reduction(+:a_acc)
    for( i = 0; i < n; ++i ){
        a_acc=a_acc+e_count[i].x+e_count[i].y;
    }

    #pragma acc host_data use_device(e_count)
    {
        cublasHandle_t handle;
        cublasCreate(&handle);
        cublasScasum(handle,n,e_count,incx,&a_blas);
    }
    }

    /* check the results */
    if(a_acc!=a_blas) printf( "Test FAILED\n");
    printf( "%13d iterations completed\n", n );

}
% nvc -cuda -cudalib=cublas -acc test.c -w -fast ; a.out
       100000 iterations completed

Hope this helps,
Mat

I know where the problem lies, I included an incorrect header file, and I used it

#include "cublasXt.h"

I have another question, in my code, a_ Blas and a_ How can acc copy back from GPU to CPU.I adopted the following method

    #pragma acc  data  copyin(e_count[0:n],n) copyout(a_blas,a_acc)
   {
    #pragma acc parallel reduction(+:a_acc)
    for( i = 0; i < n; ++i ){
        a_acc=a_acc+e_count[i].x+e_count[i].y;
    }
    #pragma acc host_data use_device(e_count)
    {
        cublasScasum(handle,n,e_count,incx,&a_blas);
    }
    }
    printf( "a_blas=%13ld \n", a_blas);
    printf( "a_acc=%13ld \n", a_acc);
    /* check the results */
    if(a_acc!=a_blas) printf( "Test FAILED\n");

The operation results are as follows

a_blas=281473575516772 
a_acc=            0 
Test FAILED

Why a_ acc not copy back from GPU to CPU?

Your format is incorrect given these variables are floats not long integers. Using “f” instead if “ld” will fix the “a_acc” issue.

Your “a_blas” will be incorrect since you’re over writing the host value with the device value which is uninitialized. You should remove “a_blas” from the copy clause.

To illustrate:

This creates an uninitialized “a_blas” on the device

#pragma acc data copyin(e_count[0:n],n) copyout(a_acc,a_blas)

Here the host copy of “a_blas” is returned from the cuBlas routine

cublasScasum(handle,n,e_count,incx,&a_blas);

At the end of the data region, the device copy of a_blas is copied back to the host overriding the result from cuBlas.

} // end of data region

Thank you very much, mat

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.