cublasDgemm perfomance on TK1


I have run a matrixMul test using cublasSgemm and cublasDgemm on TK1 (Unified Memory).
C = A^T * A, matrix A size(3200,320): column major, matrix C size(320, 320): column major.

Unified Memory is allocated before the fucntion.

1)cublasSgemm : 8.1489 ms
2)cublasDgemm : 56.7583ms

first question: The time elpased is correct?
second question: Why double precision is ~7x slower than single precision?

cublasDgemm function

void cublas_Dgemm_unified(MatrixXd& matA, MatrixXd& matC, int K, int N)
	memcpy(monoA,, K*N*sizeof(double));
	// gemm
	double alpha = 1.0f;
	double beta = 0.0f;
	cublas_safe_call( cublasDgemm(handle, CUBLAS_OP_T, CUBLAS_OP_N, N, N, K, 
		&alpha, monoA, K, monoA, K, &beta, monoC, N) );
	// transfer
	memcpy(, monoC, N*N*sizeof(double));


Why is double precision version ~7x slower than single precision?
That’s too slow!