I’ve got a question regarding what the limits are in regard to blocks of threads.
When you launch a global function, you need to specify the number of blocks and threads you’ll be using. I’m designing my program so that it can run an unknown number of simulations. Each simulation is carried out by a thread. Since, there is a limited amount of memory, I would need to break up the simulation into manageable chunks. After trial and error, I arrive at the following calculations from the occupancy calculator:
I can see from this spread sheet that it makes no mention of the number of blocks, when you enter the usage. Why doesn’t the spreadsheet take that into account. Does the scheduler on the GPU schedule kernel invocations of blocks somehow? Say if 64 blocks will all fit on the GPU and each block contains 384 threads (or 24,576 simulations in total), will the GPU schedule 5 sets of 64 blocks through the GPU. In other words, if the number of blocks is larger than the available resources, does it manage this?
What I’ve been doing is choosing a number of threads based on the occupancy calculator and plugging that into my program. Then I calculate the number of blocks that will be required based on the number of simulations I need to run. Here’s the relevant code:
/* Based on Cuda occupancy calculator,
which allows for 50% occupancy */
const int numThreads = 384;
int numSimulations;
int numBlocks;
readFile (fp, &numSimulations);
/* Proportional to the number of simulations */
numBlocks = ceil ( (float) numSimulations / (float) numThreads );
/* Device Grid and Block Size */
dim3 grid (numBlocks);
dim3 threads (numThreads);
/* Allocate shared memory size */
sharedMemorySize = 2 * sizeof(float) * nr;
photonSim<<< grid, threads, sharedMemorySize>>> (d_Rr, d_Tr, d_randNumArray);
I guess I just need a sanity check, does this work? Can you have unlimited blocks, but limited threads?
If not, then should I have setup a loop with something like this?
for ( numBlocks = 64; numBlocks < 320; numBlocks += 64)
{
dim3 grid (numBlocks);
dim3 threads (numThreads);
/* Allocate shared memory size */
sharedMemorySize = 2 * sizeof(float) * nr;
aSim<<< grid, threads, sharedMemorySize>>> (d_Rr, d_Tr, d_randNumArray);
}
If there is a limit to the number of blocks, then how can I calculate it?
Thanks for your help,
Craig