Improving the performance of matrix-matrix multiplication

Hi,
I have GeForce GTX1060 and this is my matrix-matrix Mul code.

#pragma acc enter data copyin(A[0:N], B[0:N], C[0:N])
#pragma acc kernels present(A[0:n], B[0:n], C[0:n])
#pragma acc loop independent tile(16,64)
for(int i = 0; i < n; ++i){
    for(int j = 0; j < n; ++j){
        double tmp = 0;
        #pragma acc loop reduction(+:tmp)
        for(int k = 0; k < n; ++k){
            tmp += A[n*i+k] * B[n*k+j];
        }
        C[n*i+j] = tmp;
    }
}
#pragma acc exit data copyout(C[0:N]) delete(A[0:N], B[0:N])

This is compile options.

pgc++ -acc -fast -O3

This code can run as fast as cuBLAS in double precision on GTX1060.
But more than ten times slower in float precision.

How can I improve performance in float precsion?

Thank you.