CUBLAS SGEMM performance

What is the typical sgemm performance on various cards?

On my 8600GT OC, I am getting 23 GFLOPs, which seems a bit low.

On a high end card (Tesla C870, Quadro FX5600, GeForce8800 GTX) I have measured ~120GFlops on the card.
If you include the I/O overhead, you can get ~100Gflops with pinned memory.

Right, so that should come out to ~30 Gflops for me. In fact, I remember getting that before but I can’t reproduce it.

I modified simpleCUBLAS to peasure performance. Here is my code:

QueryPerformanceFrequency(&freq);

	QueryPerformanceCounter(&start_time);

	/* Performs operation using cublas */

	cublasSgemm('n', 'n', N, N, N, alpha, d_A, N, d_B, N, beta, d_C, N);

	cublasGetVector(1/*n2*/, sizeof(h_C[0]), d_C, 1, h_C, 1);

	QueryPerformanceCounter(&end_time);

	duration = ((double) end_time.QuadPart - (double)start_time.QuadPart) / (double)freq.QuadPart;

	fprintf (stderr, "Kernel time: %1.3fs\n", duration);

	fprintf (stderr, "Performance: %fGFLOPS\n", 2.0*N*N*N / 1e9 / duration);

btw, what’s a better way to sync than to read back some data? (And why doesn’t cublasGetError cause a sync?)

cudaThreadSynchronize();

no, this is cublas. there is no cudathreadsynhcronize

You can still use it.
You could also enable profiling, CUDA will fall back to the blocking mode.