I have been coding my first CUDA program on a GTX 470 and have been using
#define NUM_THREADS ###
Where ### is a number I vary while I am debugging. I have been able to get the Kernel to work correctly using small thread sizes (much easier to debug). However as I increase it to 1024 or 512 … 64 the Kernel does not appear to launch. I even put a printf(“hi”) right at the top of the kernel. Nothing. If I decrease the size it works again and prints out “hi!”. I tallied up all my global memory and even at 1024 threads (max) I am barely using 128 MB global memory. I do use a lot of registers in the kernel (which I am trying to eliminate), but I thought that if there were too many registers to fill the SM, it would just schedule less (i.e. max threads per SM is 1024, but because lots of registers are used it would only be 256 threads per SM or something). Is that correct?
I have also noticed during debugging that if a Kernel address goes out of bounds it appears to stop (crash?) the Kernel. Is that what happens when an out of bounds or unallocated address is used in the Kernel?
I am furiously checking all my Kernel addresses for validity, but just wanted to see if there was something else going on here.