Dense matrix vector dot product On a GeForce 9300 GE

I’m trying to get a good Matrix Vector Dot Product implementation working on a 9300 GE for Dense matricies stored as single precision floating point arrays in row major form.

The equation I’m trying to solve is
c = A * b, where ‘A’ is an nxn matrix and ‘c’ and ''b are size n vectors, ‘A’ and ‘b’ are given, ‘c’ must be found.

So far, after about 8 major implementations (playing with block size, thread count among other things), my methods are never faster than a CPU implementation, although a couple have come close to matching it.

This may be due to my hardware, the 9300 GE is a compute capability 1.1 device that has 1 SMP with a clock rate of 540 MHz, whereas my CPU is a core 2 quad at 2.66 GHz.

My most succesful implementations avoid warp divergence, utilize shared memory, make coalesced memory accesses, and avoid bank conflicts.

Are there any other tricks I could use to improve? Is utilizing texture memory worth it? It’s designed for matricies but I’m already coalescing all my global memory access.

Given that a single core of your CPU has a peak throughput of 21.28 GFLOP/s, while you GPU has a usable throughput of 17.28 GFLOP/s, I’m not surprised.

Where did you look that information up?

Are you saying that my GPU actually has less ‘computing power’ than my CPU?

I just took the data you provided (together with my knowledge of the architectures):

A single core of a core 2 quad can perform two arithmetic operations on four operands per cycle in it’s SSE(2) SIMD unit. At 2.66 GHz, this results in 2×4×2.66 GHz = 21.28 GFLOP/s.
A single SM of a compute capability 1.x device performs a single multiply-add operation per cycle in each of its 8 ALUs (nowadays called “cores” by Nvidia), which run at twice the device clock rate of 540 MHz. Counting the multiply-add as two operations and discounting another multiply that can sometimes be scheduled in the special function units, but cannot be used in matrix-vector multiplication because the latter has a balanced multiplication/addition ratio, this gives 2×8×2×0.54 GHz = 17.28 GFLOP/s.

And yes, I am saying your GPU has less computational power than your CPU (which on top of the above calculation has four cores). You are comparing a high-end CPU to the very lowest available (actually outdated) GPU.

To realize the potential advantages of GPU computing you’ll have to invest into a high-end GPU. A 460 GTX would be a balanced choice (even though not top-notch). It achieves between 604.8 and 907.2 GFLOP/s peak at about 175 US$. Nvidia’s fastest single-GPU card, the 580 GTX, has a peak throughput of 1,581 GFLOP/s.

It is also worth pointing out that the matrix-vector product is not a great candidate for GPU acceleration anyway. The basic real implementation has a lower bound of (N^2 + 2N) memory transactions for (2N^2) FLOPs, which is not a recipe for hitting peak throughput.

Yes, the matrix vector dot product problem isn’t as “fine-grained” as a dot-product on two matricies; but because I have made a decent matrix dot product implementation that improves performance by several factors, I wanted to see if I could accelerate a matrix vector dotproduct operation.

But it doesn’t look very possible (atleast with my hardware) and that’s good to know so I can move on to solving other problems.

Thank you everyone.

It’s a good candidate for GPU acceleration. GPUs have far higher bandwidth than any CPU based configuration ( 192 GB/s on the 580). But if you’re gonna be passing the data over the PCIe just to do a matrix-vector product it might not be very worthwhile.