I’m quite new at Cuda programming and I took the example of Cuda matrix multiplication (without using shared memory) from the Programming Guide, the result is right but is too slow. I use two 1024 x 1024 matrix with 16 x 16 blocks and I get an execution time of 5.39s (both in Debug and Release mode) whereas I get in C alone: 6.64s in Debug mode and 4.09s in Release mode. I use Visual Studio 2005. So in Release mode, Cuda seems no better than C, so I think I must have done something wrong somewhere.
Could you tell me what I did wrong, please?
Thank you for your help.
Try CUBLAS - the SDK code is not too efficient - “(without using shared memory)” - not a good idea. You also do not say what kind of card you have. A GTX280 will perform at about 380 GFlops single precision (SGEMM), while a 4-core Xeon will, with optimized code, using all cores, will be 80 GFlops.
Something else has got to be wrong here. The SDK matrix multiply example is about one third the speed of CUBLAS (I was timing this recently, for my own nefarious purposes), and copying 1024^2 matrices to the GPU and back isn’t that slow. What do these times include? Was the CUDA context already established before the timer began? Those times look suspiciously like ‘whole program’ times.
I took the example of Cuda matrix multiplication using shared memory from the Programming Guide. I use two 1024 x 1024 matrix with 16 x 16 blocks. I take use 8 registers per thread. I use a GPU 8400 GS with 8 stream processors (1400 MHz).
I get an execution time (for the kernel alone) of 387ms.
Please could you tell me whether it is a slow or a normal execution time?
Thank you for your help. :)