Hi,
I have GeForce GTX1060 and this is my matrix-matrix Mul code.
#pragma acc enter data copyin(A[0:N], B[0:N], C[0:N])
#pragma acc kernels present(A[0:n], B[0:n], C[0:n])
#pragma acc loop independent tile(16,64)
for(int i = 0; i < n; ++i){
for(int j = 0; j < n; ++j){
double tmp = 0;
#pragma acc loop reduction(+:tmp)
for(int k = 0; k < n; ++k){
tmp += A[n*i+k] * B[n*k+j];
}
C[n*i+j] = tmp;
}
}
#pragma acc exit data copyout(C[0:N]) delete(A[0:N], B[0:N])
This is compile options.
pgc++ -acc -fast -O3
This code can run as fast as cuBLAS in double precision on GTX1060.
But more than ten times slower in float precision.
How can I improve performance in float precsion?
Thank you.