cublasDgemm perfomance on TK1

Hi,

I have run a matrixMul test using cublasSgemm and cublasDgemm on TK1 (Unified Memory).
C = A^T * A, matrix A size(3200,320): column major, matrix C size(320, 320): column major.

Unified Memory is allocated before the fucntion.

1)cublasSgemm : 8.1489 ms
2)cublasDgemm : 56.7583ms

first question: The time elpased is correct?
second question: Why double precision is ~7x slower than single precision?

cublasDgemm function

void cublas_Dgemm_unified(MatrixXd& matA, MatrixXd& matC, int K, int N)
{
	memcpy(monoA, matA.data(), K*N*sizeof(double));
	
	// gemm
	double alpha = 1.0f;
	double beta = 0.0f;
	cublas_safe_call( cublasDgemm(handle, CUBLAS_OP_T, CUBLAS_OP_N, N, N, K, 
		&alpha, monoA, K, monoA, K, &beta, monoC, N) );
	cudaDeviceSynchronize();
	
	// transfer
	memcpy(matC.data(), monoC, N*N*sizeof(double));
}

Thanks!

Why is double precision version ~7x slower than single precision?
That’s too slow!