Hi,
I have run a matrixMul test using cublasSgemm and cublasDgemm on TK1 (Unified Memory).
C = A^T * A, matrix A size(3200,320): column major, matrix C size(320, 320): column major.
Unified Memory is allocated before the fucntion.
1)cublasSgemm : 8.1489 ms
2)cublasDgemm : 56.7583ms
first question: The time elpased is correct?
second question: Why double precision is ~7x slower than single precision?
cublasDgemm function
void cublas_Dgemm_unified(MatrixXd& matA, MatrixXd& matC, int K, int N)
{
memcpy(monoA, matA.data(), K*N*sizeof(double));
// gemm
double alpha = 1.0f;
double beta = 0.0f;
cublas_safe_call( cublasDgemm(handle, CUBLAS_OP_T, CUBLAS_OP_N, N, N, K,
&alpha, monoA, K, monoA, K, &beta, monoC, N) );
cudaDeviceSynchronize();
// transfer
memcpy(matC.data(), monoC, N*N*sizeof(double));
}