CUDA software and hardware mapping

Rahman · February 16, 2009, 4:15pm

Hello

     I am a beginner at CUDA development. I have some questions on CUDA software to hardware mapping. These are as follows:

What is the maximum number of blocks that a program can allocate?
Can all the blocks allocate the maximum number of threads (512) in a program?
How does threads in different blocks communicate ? (Global Memory)
What is the maximum number of threads that a processor can execute?
A multiproceesor executes a grid of blocks or a block?

Thanks
Rahman

Jamie_K · February 16, 2009, 4:47pm

Correct me if I’m wrong, I’m still learning:

A program can allocate up to 65536 blocks
Yes, but if a block contains too many threads, then it may require too many resources (registers, shared memory) to fit on a single multiprocessor and cannot execute.
Avoid communicating between blocks. They should be able to run in any order.
A processor can execute one warp (32 threads) simultaneously, and time-slices between warps if there are more than 32 threads allocated to it (in the same block or multiple blocks)
A multiprocessor executes one or more blocks, but generally not an entire grid. As many as will fit, up to the available resources, or a maximum of 8, whichever is less.

MisterAnderson42 · February 16, 2009, 5:33pm

Small correction: 65535*65535 blocks can be run. (the limit is 65535 in each of 2 dimensions).

JohnW · February 20, 2009, 5:50am

Great feedback on this thread! Can I stretch it a bit further please?

I am another beginner who is trying to find the maximum data set a Compute 1.3 card can handle. Increasing the data set until I get a kernel failure is what I am currently doing using 1D Grids and topping out at about 30 million data points. I’d prefer to be more scientific about it and also move to 2D Grids.

As I understand it there are 1024 Threads/MP on a Compute 1.3 card, and 30 multiproc. I also understand a high block dimension leads to more efficient processing. The maximum Block dimension is limited to dimBlock(512, 1, 1);

The barrier I am running into is in setting the dimGrid parameter. Currently I am using:
dim3 dimGrid(DataPoints/dimBlock.x,1, 1);

Consequently DataPoints are limited to 65535*512, or less than 5792 x 5792 (really 4096 as it is a power of 2).

DataPoints = Datax*Datax;

This is what I currently use
A: dim3 dimBlock(512, 1, 1);
A: dim3 dimGrid(DataPoints/dimBlock.x, 1, 1);

This is OK, but reduces the number of Blocks
B: dim3 dimBlock(16, 16, 1);
B: dim3 dimGrid(Datax/dimBlock.x, Datax/dimBlock.y, 1);

This doesn’t work.
C: dim3 dimBlock(16, 16, 2);
C: dim3 dimGrid(Datax/dimBlock.x/dimBlock.z, Datax/dimBlock.y/dimBlock.z, 1);

How do I utilise a full 512 Blocks but expand the data set to 2D?

Thanks,
John

[Edit to correct dimBlock mistake]

Jamie_K · February 20, 2009, 11:15pm

I ran up against the 65535 limit myself and I handled it something like this (paraphrasing):

[codebox]

int totalblocks = DataPoints / threadsperblock;

int blocksx = totalblocks;

int blocksy = 1;

while (blocksx > 65535 && (blocksx % 2 == 0)) {

blocksx /= 2;

blocksy *= 2;

}

dim3 dimGrid(blocksx, blocksy);

[/codebox]

If you want 512 threads per block and more than 65535 blocks you could do something similar to your case B:

[codebox]

dim3 dimBlock(16, 32, 1);

dim3 dimGrid(Datax/dimBlock.x, Datax/dimBlock.y, 1);

[/codebox]

For me if the problem is not intrinsically 2-D, I try to keep the y and z dimensions equal to 1, and only expand to 2D to overcome the 65535 limit.

The related question of whether 512 threads per block is the best number, I’m not sure. In general terms I don’t think it’s usually optimal to maximize the threads per block, because the threads-per-multiproc cap will mean fewer blocks can be active, meaning when they stall due to memory or for synchronization, there is less chance to work on something else. With a max of 1024 threads on a multiproc and 512 threads per block, at most two blocks can be active, and if they are both stalled then the processor sits idle. With 128 threads per block, 8 blocks could be active (other resources allowing) and might be less likely to all be stalled.

Like I said I’m still learning, I’d be interested to hear other opinions or experiences.

JohnW · February 21, 2009, 12:36am

Hi Jamie, thanks for the insight. It’s helped a good deal. In case you are interested here is the GPU occupancy with differing Block configurations with a large data set size.

//const dim3 dimBlock(8, 8, 1); // 64 87.95%
//const dim3 dimBlock(16, 8, 1); // 128 88.48%
//const dim3 dimBlock(16, 16, 1); // 256 88.33%
//const dim3 dimBlock(22, 22, 1); // 484 87.75%
//const dim3 dimBlock(32, 16, 1); // 512 87.53%

Topic		Replies	Views
Limit to Number of Blocks? Noob Question CUDA Programming and Performance	4	2983	May 16, 2008
Run 2 Multiprocessors from one global function CUDA Programming and Performance	3	541	January 18, 2018
Thread Number Limitation CUDA Programming and Performance	3	3888	December 22, 2008
How to decide the optimal block size in CUDA CUDA Programming and Performance	4	27468	February 15, 2010
Questions about Block and Grid CUDA Programming and Performance	4	3542	February 26, 2008
Mapping of Blocks to MPs / Threads to MPs CUDA Programming and Performance	1	601	November 19, 2013
Block/threads and stuff... CUDA Programming and Performance	5	4901	September 12, 2008
I wonder maximum number of threads per block really limits the number of threads in each block. CUDA Programming and Performance	5	3974	February 9, 2024
help with some cuda programming CUDA Programming and Performance	9	1817	August 31, 2009
Maximum Grid and Block size CUDA Programming and Performance	1	2881	February 21, 2009

CUDA software and hardware mapping

Related topics