Matrix Multiplication Throughput


I’m new on CUDA programming. I’m running the example of matrix multiplication from the CUDA sdk. My hardware is a TeslaC1060. I cannot achieve more than 2.5Gflops :s . Is that normal? I don’t understand why

thanks in advance


The SDK examples are not well optimized - their intention is more to get you started with CUDA at all (I know, the makefiles seem to contradict that).
See this thread or this thread for really well optimized matrix multiply rountines.

I get ~200 Gflop/s in the SDK example run on GTX280.

You get small performance because of two reasons: (i) the runtime reported by the example includes PCIe transfer; (ii) the matrix sizes used in the example are very small.

If you use larger matrices, such as ~1000x1000, and won’t include PCIe transfers in the timing you’ll see that the example code is only about of 2x slower than CUBLAS on GTX280.