GPUs, How do they work? Suspiciously fast matrix dot product execution

I wrote a Matrix Dot Product operation in CUDA that uses block sizes of 16x16 elements and utilizes shared memory accesses.

I ran the kernel on my card (GeForce 9300 GE) and ran it against a CPU impementation (just three nested for loops on a 2.66 GHz Intel core 2 quad).

For a dot product operation between two 800x800 matricies the CPU took 6338.66 milliseconds and the GPU took 191.34 milliseconds. The GPU’s about 33 times faster in that case (I’m including the time it takes for memory transferes to take place). That seems almost too fast considering the GeForce 9300 GE only has 8 SPs.

Does anyone have any theories on how this much speed up was possible or is my timing wrong?

  1. do you verify results of CPU and GPU?

  2. what is your CPU code? tile approach or naive three-for-loop?

    If you use naive implementation, then you will wait forever if dimension is large.

The algorithm for the cpu is

    For every cell in the result matrix

    Multiply and add all the cells in the same row of the ‘A’ matrix and the same column of the ‘B’ Matrix for the cell in the ‘C’ result Matrix

Basically the CPU version goes to each cell, one cell at a time, and calculates the result for that cell before moving on to the next cell. Would a tiled approach on the CPU work faster?

Additionally, after computation was complete, the CPU and GPU results were compared to one another and each cell was within 0.01 of each other.


I think I know the reason for this… You are a very good CUDA programmer. Welcome to the club!