Computing a gazillion of direct products Can it be done efficiently on CUDA?

Hello all,

My application will need to compute direct products of a single vector (~700 float values) with 100,000 other vectors (same length, obviously), and do it very fast.

Can this be done efficiently on CUDA and, say, the GT200 cards? Or will this be constrained by memory accesses? I read that the computation needs to have high “compute/memory access” ratio to work well on CUDA.

What other pitfalls are there to look for in a computation like that?

Any advice will be appreciated!



Your app will certainly be memory bound, but most are. Having a high compute/memory access is only needed if you want to get a lot of GFLOP/s. There is nothing slow about the 141 GiB/s of memory bandwidth available on G200 so your app will still run extremely fast compared to a CPU implementation. You could estimate the performance you will get by counting the number of memory bytes read and written and assuming a memory throughput of ~110 GiB/s (80% of the peak is achievable).

Make absolutely sure you coalesce memory reads and writes.

So this is 100,000 dot products of 700 dimensional vectors?
You’re right that this is not a compute limited applicatoon since the math is so simple.
Basically you’re sending 70,000,000 floats over the PCIe bus just so they can be used for a single multiply each. Ouch, you’re not going to win anything with that.

Now that may be fine if you have to do this a lot with MANY different weight vectors, in that case the transfer time can be amortized. In that case you’ll still be GPU memory bandwidth limited, but that’s still a LOT (10x) faster than the CPU, which would also be memory bandwidth limited.

I’d disagree. The 70 Million floats of 4 Bytes each are just ~270MB so that they are transfered to the GPU in pretty much no time.

Anyways, not sure, but I’d give it a shot since it sounds like an easy to implement problem.

No - I will need the direct products calculated multiple times, and the 100,000 vectors are always the same: only the single vector that I multiply them with is changed.

So I plan to have only one transfer of the 100,000 vectors into the device memory, and then just provide the single “query vector” every time I need the direct products.

Then you’ll do very very well on the GPU, mostly from the huge GPU on-device bandwidth.

If you can provide more than one query vector at once, you can compute several in parallel. sharing the memory bandwidth used to stream your big array. That’d get even further multiple speed boosts. But even with a single query vector you should see great speed.

Yeah I see what you mean… So here’s another question following from that:

when you use several query vectors, you are in effect doing matrix multiply - a large matrix, which is always the same, and a smaller matrix which changes every time.

CUDA BLAS should do matrix multiply very well, but is there a way to avoid the PCIe transfer of the large matrix every time the call is made with CUDA BLAS?

I hope I am not asking something that’s too specialized.

You could write your own matrix multiplication algorithm. Matrix multiplication must be the most common example of CUDA code in all tutorials so you should be able to find good reference.

Cublas expects you matrices to be on the device already anyways. So you can just copy your matrix to the device using either cublasSetMatrix or plain cuda-memcpy, then call cublas operations several times and update only the small matrix in between. There is no implicit data transfer done by cuda.