CUDA software and hardware mapping


     I am a beginner at CUDA development. I have some questions on CUDA software to hardware mapping. These are as follows:
  1. What is the maximum number of blocks that a program can allocate?
  2. Can all the blocks allocate the maximum number of threads (512) in a program?
  3. How does threads in different blocks communicate ? (Global Memory)
  4. What is the maximum number of threads that a processor can execute?
  5. A multiproceesor executes a grid of blocks or a block?


Correct me if I’m wrong, I’m still learning:

  1. A program can allocate up to 65536 blocks

  2. Yes, but if a block contains too many threads, then it may require too many resources (registers, shared memory) to fit on a single multiprocessor and cannot execute.

  3. Avoid communicating between blocks. They should be able to run in any order.

  4. A processor can execute one warp (32 threads) simultaneously, and time-slices between warps if there are more than 32 threads allocated to it (in the same block or multiple blocks)

  5. A multiprocessor executes one or more blocks, but generally not an entire grid. As many as will fit, up to the available resources, or a maximum of 8, whichever is less.

Small correction: 65535*65535 blocks can be run. (the limit is 65535 in each of 2 dimensions).

Great feedback on this thread! Can I stretch it a bit further please?

I am another beginner who is trying to find the maximum data set a Compute 1.3 card can handle. Increasing the data set until I get a kernel failure is what I am currently doing using 1D Grids and topping out at about 30 million data points. I’d prefer to be more scientific about it and also move to 2D Grids.

As I understand it there are 1024 Threads/MP on a Compute 1.3 card, and 30 multiproc. I also understand a high block dimension leads to more efficient processing. The maximum Block dimension is limited to dimBlock(512, 1, 1);

The barrier I am running into is in setting the dimGrid parameter. Currently I am using:
dim3 dimGrid(DataPoints/dimBlock.x,1, 1);

Consequently DataPoints are limited to 65535*512, or less than 5792 x 5792 (really 4096 as it is a power of 2).

DataPoints = Datax*Datax;

This is what I currently use
A: dim3 dimBlock(512, 1, 1);
A: dim3 dimGrid(DataPoints/dimBlock.x, 1, 1);

This is OK, but reduces the number of Blocks
B: dim3 dimBlock(16, 16, 1);
B: dim3 dimGrid(Datax/dimBlock.x, Datax/dimBlock.y, 1);

This doesn’t work.
C: dim3 dimBlock(16, 16, 2);
C: dim3 dimGrid(Datax/dimBlock.x/dimBlock.z, Datax/dimBlock.y/dimBlock.z, 1);

How do I utilise a full 512 Blocks but expand the data set to 2D?


[Edit to correct dimBlock mistake]

I ran up against the 65535 limit myself and I handled it something like this (paraphrasing):


int totalblocks = DataPoints / threadsperblock;

int blocksx = totalblocks;

int blocksy = 1;

while (blocksx > 65535 && (blocksx % 2 == 0)) {

blocksx /= 2;

blocksy *= 2;


dim3 dimGrid(blocksx, blocksy);


If you want 512 threads per block and more than 65535 blocks you could do something similar to your case B:


dim3 dimBlock(16, 32, 1);

dim3 dimGrid(Datax/dimBlock.x, Datax/dimBlock.y, 1);


For me if the problem is not intrinsically 2-D, I try to keep the y and z dimensions equal to 1, and only expand to 2D to overcome the 65535 limit.

The related question of whether 512 threads per block is the best number, I’m not sure. In general terms I don’t think it’s usually optimal to maximize the threads per block, because the threads-per-multiproc cap will mean fewer blocks can be active, meaning when they stall due to memory or for synchronization, there is less chance to work on something else. With a max of 1024 threads on a multiproc and 512 threads per block, at most two blocks can be active, and if they are both stalled then the processor sits idle. With 128 threads per block, 8 blocks could be active (other resources allowing) and might be less likely to all be stalled.

Like I said I’m still learning, I’d be interested to hear other opinions or experiences.

Hi Jamie, thanks for the insight. It’s helped a good deal. In case you are interested here is the GPU occupancy with differing Block configurations with a large data set size.

//const dim3 dimBlock(8, 8, 1); // 64 87.95%
//const dim3 dimBlock(16, 8, 1); // 128 88.48%
//const dim3 dimBlock(16, 16, 1); // 256 88.33%
//const dim3 dimBlock(22, 22, 1); // 484 87.75%
//const dim3 dimBlock(32, 16, 1); // 512 87.53%