I wrote a Matrix Dot Product operation in CUDA that uses block sizes of 16x16 elements and utilizes shared memory accesses.
I ran the kernel on my card (GeForce 9300 GE) and ran it against a CPU impementation (just three nested for loops on a 2.66 GHz Intel core 2 quad).
For a dot product operation between two 800x800 matricies the CPU took 6338.66 milliseconds and the GPU took 191.34 milliseconds. The GPU’s about 33 times faster in that case (I’m including the time it takes for memory transferes to take place). That seems almost too fast considering the GeForce 9300 GE only has 8 SPs.
Does anyone have any theories on how this much speed up was possible or is my timing wrong?