Limit to Number of Blocks? Noob Question

I’ve got a question regarding what the limits are in regard to blocks of threads.

When you launch a global function, you need to specify the number of blocks and threads you’ll be using. I’m designing my program so that it can run an unknown number of simulations. Each simulation is carried out by a thread. Since, there is a limited amount of memory, I would need to break up the simulation into manageable chunks. After trial and error, I arrive at the following calculations from the occupancy calculator:

I can see from this spread sheet that it makes no mention of the number of blocks, when you enter the usage. Why doesn’t the spreadsheet take that into account. Does the scheduler on the GPU schedule kernel invocations of blocks somehow? Say if 64 blocks will all fit on the GPU and each block contains 384 threads (or 24,576 simulations in total), will the GPU schedule 5 sets of 64 blocks through the GPU. In other words, if the number of blocks is larger than the available resources, does it manage this?

What I’ve been doing is choosing a number of threads based on the occupancy calculator and plugging that into my program. Then I calculate the number of blocks that will be required based on the number of simulations I need to run. Here’s the relevant code:

 /* Based on Cuda occupancy calculator, 

     which allows for 50% occupancy */

  const int numThreads = 384;

  int numSimulations;

  int numBlocks;

  readFile (fp, &numSimulations); 

  /* Proportional to the number of simulations */

   numBlocks = ceil ( (float) numSimulations / (float) numThreads );

   /* Device Grid and Block Size */

    dim3 grid (numBlocks);

    dim3 threads (numThreads);

   /* Allocate shared memory size */

    sharedMemorySize = 2 * sizeof(float) * nr;

   photonSim<<< grid, threads, sharedMemorySize>>> (d_Rr, d_Tr, d_randNumArray);

I guess I just need a sanity check, does this work? Can you have unlimited blocks, but limited threads?

If not, then should I have setup a loop with something like this?

for ( numBlocks = 64; numBlocks < 320; numBlocks += 64)

{

    dim3 grid (numBlocks);

    dim3 threads (numThreads);

   /* Allocate shared memory size */

    sharedMemorySize = 2 * sizeof(float) * nr;

   aSim<<< grid, threads, sharedMemorySize>>> (d_Rr, d_Tr, d_randNumArray);

}

If there is a limit to the number of blocks, then how can I calculate it?

Thanks for your help,

Craig

You can run up to 65535*65535 blocks. See Appendix A in the programming guide, this is the hardware limit of the block dimensions.

The occupancy calculator doesn’t take into account the number of blocks because it assumes that you have launched enough blocks to saturate all the multiprocessors. Up to 8 blocks can run concurrently on a single multiproccessor (assuming the register and shared mem usage allows it).

On the device, all blockDim.x*blockDim.y blocks start in a waiting queue and the multiprocessors are filled up with running blocks, removing them from the queue. As soon as one block completes, its space is immediately filled with another block from the queue.

Thanks very much! External Image

Me. I think that the maximum blocks on a grid is 65535. if you use more than 65535 (example 65536) the kernel function would’n launch

You can run 65535 * 65535 blocks. So you would have to launch a 2D-grid. Neither of the 2 dimensions are allowed to surpass 65535.