how to design grid and block

swhastan · May 16, 2009, 4:58pm

Hi, I’m CUDA novice, and happy to know this forum.

Now, I’m trying to parallelize my code with cuda.

My code has computationally intensive routines. In this routine, there is three for-loops which are repeated 5x5x4 times.

So, at the first step, I’d like to parallelize this code. Now, here is a question.

dim3 grid(1,1);
dim3 block(5,5,4);

dim3 grid(4,1);
dim3 block(5,5,1);

Which one is more efficient?

Also, could you suggest better way if any?

Thank you.

gatoatigrado · May 16, 2009, 5:24pm

yes, use much more. 554 = 100 threads. You should be using at least 5000. 100 calculations doesn’t seem intensive for the CPU even. If each routine is dependent on the next, the CPU is the way to go.

regards,
Nicholas

swhastan · May 16, 2009, 7:43pm

Thank you for your reply.

I’m just trying to converting part of my code. If it is sucessful, I’ll do more.

I’m just wondering which way is more efficient.

dim3 grid(1,1);
dim3 block(5,5,4);

dim3 grid(4,1);
dim3 block(5,5,1);

If I use 100 thread in a block, they will not be concurrently executed as I guess because only a warp of threads (32) can be concurrently executed in a multiprocessor.

So, I think that 2) is better than 1). Am I right?

Thank you.

kristleifur · May 17, 2009, 12:14pm

These two cases are too small to think about in terms of efficiency. Latency will kill it.

But yes, I’d guess that using 4 multiprocessors instead of 1 is better, so #2 is more efficient … but it still looks like a false optimisation to me. I mean, with any reasonable size of problem, you’ll be using all multiprocessors anyway. Then register usage, latency hiding and memory access patterns will dominate, and it depends on the algorithm whether you get better results with 5x5x4 blocks or 5x5x1 blocks. I’d guess that in a real-world case, with enough data to properly get the multiprocessors going, you’ll see more speed with 5x5x4 blocks because you’ll hide more global memory latency.

gatoatigrado · May 18, 2009, 7:13am

Yes, only 32 threads are concurrently executed (actually 8 calculating at once, but reads/writes are synchronized for the entire warp iirc). If these are indeed mapped to separate multiprocessors, it will be more efficient (I’m not sure if the first multiprocessor will just grab more blocks though).

regards,
Nicholas

how to design grid and block

dim3 grid(1,1); dim3 block(5,5,4);

dim3 grid(1,1); dim3 block(5,5,4);

dim3 grid(1,1);
dim3 block(5,5,4);

dim3 grid(1,1);
dim3 block(5,5,4);