I was doing a simple test to see how many concurrent thread blocks can run on a Tesla C1060.
The kernel is very straightforward: after launching, thread (0,0,0) of each thread block sets a flag, then the block(0,0) checks if all flags are set, if yes the kernel finishes.
Since C1060 has 30 SM, and each can support 1024 active threads, I assume that I can have 120 concurrent thread blocks, each of 256 threads. But if I run with this execution configuration, i.e., <<<120,256>>>, the kernel will hang forever, the flags for blocks (90,0) to (119,0) will never be set. However, if I run with <<<120,192>>>, the kernel finishes smoothly. But in this case, the number of concurrent active threads is 192 x 4 = 768, which is only 75% of 1024, the maximum number of active threads supported. Strange enough, if I run with <<<240,128>>>, the kernel exits correctly, which means the device can support running kernels with the full hardware resource :wacko: