What is the typical sgemm performance on various cards?
On my 8600GT OC, I am getting 23 GFLOPs, which seems a bit low.
What is the typical sgemm performance on various cards?
On my 8600GT OC, I am getting 23 GFLOPs, which seems a bit low.
On a high end card (Tesla C870, Quadro FX5600, GeForce8800 GTX) I have measured ~120GFlops on the card.
If you include the I/O overhead, you can get ~100Gflops with pinned memory.
Right, so that should come out to ~30 Gflops for me. In fact, I remember getting that before but I can’t reproduce it.
I modified simpleCUBLAS to peasure performance. Here is my code:
QueryPerformanceFrequency(&freq);
QueryPerformanceCounter(&start_time);
/* Performs operation using cublas */
cublasSgemm('n', 'n', N, N, N, alpha, d_A, N, d_B, N, beta, d_C, N);
cublasGetVector(1/*n2*/, sizeof(h_C[0]), d_C, 1, h_C, 1);
QueryPerformanceCounter(&end_time);
duration = ((double) end_time.QuadPart - (double)start_time.QuadPart) / (double)freq.QuadPart;
fprintf (stderr, "Kernel time: %1.3fs\n", duration);
fprintf (stderr, "Performance: %fGFLOPS\n", 2.0*N*N*N / 1e9 / duration);
btw, what’s a better way to sync than to read back some data? (And why doesn’t cublasGetError cause a sync?)
cudaThreadSynchronize();
no, this is cublas. there is no cudathreadsynhcronize
You can still use it.
You could also enable profiling, CUDA will fall back to the blocking mode.