I’m trying to get a good Matrix Vector Dot Product implementation working on a 9300 GE for Dense matricies stored as single precision floating point arrays in row major form.
The equation I’m trying to solve is
c = A * b, where ‘A’ is an nxn matrix and ‘c’ and ''b are size n vectors, ‘A’ and ‘b’ are given, ‘c’ must be found.
So far, after about 8 major implementations (playing with block size, thread count among other things), my methods are never faster than a CPU implementation, although a couple have come close to matching it.
This may be due to my hardware, the 9300 GE is a compute capability 1.1 device that has 1 SMP with a clock rate of 540 MHz, whereas my CPU is a core 2 quad at 2.66 GHz.
My most succesful implementations avoid warp divergence, utilize shared memory, make coalesced memory accesses, and avoid bank conflicts.
Are there any other tricks I could use to improve? Is utilizing texture memory worth it? It’s designed for matricies but I’m already coalescing all my global memory access.