Large Thread Size prevents Kernel from running

I have been coding my first CUDA program on a GTX 470 and have been using

#define NUM_THREADS ###

Where ### is a number I vary while I am debugging. I have been able to get the Kernel to work correctly using small thread sizes (much easier to debug). However as I increase it to 1024 or 512 … 64 the Kernel does not appear to launch. I even put a printf(“hi”) right at the top of the kernel. Nothing. If I decrease the size it works again and prints out “hi!”. I tallied up all my global memory and even at 1024 threads (max) I am barely using 128 MB global memory. I do use a lot of registers in the kernel (which I am trying to eliminate), but I thought that if there were too many registers to fill the SM, it would just schedule less (i.e. max threads per SM is 1024, but because lots of registers are used it would only be 256 threads per SM or something). Is that correct?

I have also noticed during debugging that if a Kernel address goes out of bounds it appears to stop (crash?) the Kernel. Is that what happens when an out of bounds or unallocated address is used in the Kernel?

I am furiously checking all my Kernel addresses for validity, but just wanted to see if there was something else going on here.



The number of blocks per SM is indeed automatically reduced if there aren’t enough registers available. However, once the number of blocks per SM drops below one, the kernel cannot be launched anymore.

Also, you probably want to get into the habit of checking the return codes on CUDA functions, which will tell you directly if a kernel has failed to launch.

Thanks for all the help! This is my first program and there is a TON to learn.

I am writing a Genetic Algorithm which means I need to synchronize the members each iteration so I can sort them and then do the appropriate stuff to them. My original idea was to put the members in a single block (max size = 1024 which is fine) and have them collaborate using shared memory and __snycthreads(). It appears now that I can’t get a manageable size of registers so that idea isn’t working out so great. I would like to use blocks, but I am unsure how to sync them every iteration so I can sort the data set.

I believe kernel invocation is very lightweight. Is this an appropriate way to synchronize blocks? Is there a better way?

loop {

foreach(block) {

someKernel<<<dimGrid,dimBlock>>>();  // processes each block's piece of the data out of a BIG global memory chunk


sync(); // If I remember correctly the CPU keeps executing until it sees a blocking function like CudaMemCpy()

anotherKernel<<<dimGrid2,dimBlock2>>>(); // Sorts the global chunk of memory

} while(data not converged);

Is there a way to sync blocks in Compute 2.x? I know there isn’t in Compute 1.x.

On that subject, I have the book by Hwu & Kirk and have read some of the other class notes and other online resources. Many of them assume a Compute 1.x device and techniques for dealing with Compute 1.x limitations (like lack of cache). Is there a set of Compute 2.x resources that detail the upgrades in Compute 2.x and how to best take advantage of them?



Multiple kernel invocations are indeed the way to go, as there is no inter-block synchronization.

I notice you have

foreach(<b>block</b>) {

    someKernel<<<dimGrid,dimBlock>>>(); // processes each block's piece of the data out of a BIG global memory chunk


if by block above you mean a cuda block of threads then you can launch them all at once, without the foreach, and it will run faster.

True, but the data size is not constant. Since there is a limit on the size a block can realistically handle, I think the code below is appropriate. Will this code execute efficiently?

int i;


for( i=0 ; i<NUM_BLOCKS ; i++) {

    someKernel<<<dimGrid,dimBlock>>>(memory_offset_for_block_i, pointer_to_big_memory_chunk); // processes each block's piece of the data out of a BIG global memory chunk




dimGrid should be the number of blocks you want to launch
Can calculate ‘memory_offset_for_block_i’ within the kernel from blockIdx, then loop isnt needed.

PS if you launch only one block at a time then only one SM is doing anything and all the others will be idle. So if your GPU has say 30 SM you would only be getting 3% of the speed you might !
GPU’s love to have dozens or even tens of thousands of blocks, and often a smaller blocksize will work better.

PS I do use a loop sometimes, but to split ~100,000 blocks into about 10 sets so each set runs before the kernel watchdog time limit is hit (5 seconds)

I am aware that I can launch 65,536 X 65,536 grids of blocks and that to get max efficiency I need to have the 14 SMs in my GTX 470 loaded with about 6 warps each at all times to get max efficiency.

There are some details of the program I omitted for brevity that necessitate the for loop. Each iteration of the for loop with launch thousands - millions of blocks.