Designing a CUDA algo question Sort of a newbie question....

Hi all.

So I’ve read through the CUDA programming guide, and several other online sources, but there are a couple details that I think these documents glossed over (or I just didn’t make the connection I was supposed to) that I’m trying to clear up. I’ve been programming for a while, but just recently started with CUDA, and it was suggested that I try a sample equation to best illustrate the parallel nature of algorithms using CUDA.

Let’s say we have three NxN arrays, A, B and C. Large arrays (like N = 500+)

A(i,j) = square subarray of A where i is the min index an j is the max index

C(i,j) = A(i,j) + A(i,j)*B(i,j)

I and J are arbitrary, but in general the smaller the better as long as everything doesn’t become excessively slow.

So basically, taking submatrices and multiplying/adding them.

The algorithm itself isn’t that complicated to code, but in order to maximize the parallel nature of CUDA, I’m debating how to implement it.

I could copy the entire array to the device immediately, and try to just construct as many threads and necessary, and give them a piece of the input.
I could assign each block a subarray and just run the kernel on each subarray.

How should I deal with block/thread division to maximize the parallelism? For example I have 14 multiprocessors and 112 cores, so I’m assuming that means that 112 threads can be executed in parallel. But I’ve also read that in order to keep the GPU busy you should have 20-24 times the number of cores. So that would be 2688 threads. But since it’s only one block per multiprocessor, how big of a difference would it be if I used 14 blocks vs. more? It also seems like a good way to solve this problem would be one block per submatrix, with 1 thread per entry, but how can I figure out how many threads I can have per block?

I apologize if this thread is scattered; as I said I’m just starting with CUDA so there are a lot of little things I’m still trying to get a firm handle on.

Can’t you just put your kernel launch in a loop that varies the launch parameters each time around and measure the runtime of each? That will give you empirical evidence which I would trust more than anything else.

Check programming guide and matrix multiplication example.