I was doing a simple test to see how many concurrent thread blocks can run on a Tesla C1060.
The kernel is very straightforward: after launching, thread (0,0,0) of each thread block sets a flag, then the block(0,0) checks if all flags are set, if yes the kernel finishes.
Since C1060 has 30 SM, and each can support 1024 active threads, I assume that I can have 120 concurrent thread blocks, each of 256 threads. But if I run with this execution configuration, i.e., <<<120,256>>>, the kernel will hang forever, the flags for blocks (90,0) to (119,0) will never be set. However, if I run with <<<120,192>>>, the kernel finishes smoothly. But in this case, the number of concurrent active threads is 192 x 4 = 768, which is only 75% of 1024, the maximum number of active threads supported. Strange enough, if I run with <<<240,128>>>, the kernel exits correctly, which means the device can support running kernels with the full hardware resource :wacko:
1024 threads can be queued/allocated per MP. Only 32 (a warp) of them are executing at any given time. So physically the maximum number of active threads during any 4 cycles is 30 x 32. The 1024 threads (32 warps) are shuffled and timesliced, you don’t know their order of execution. They are a pool of threads ready to process from which the MP can select a warp which is not waiting for a memory fetch or __syncthreads.
Thanks for the reply. I think I should define “active” in my problem. By saying “active”, I mean the thread has started, but has not finished its execution, although it may be swapped out due to waiting for memory load or synchronizations. I wanted to know how many active blocks/threads can be running concurrently. In my case, it seems <<<120,256>>> is not possible, while <<<90,256>>> and <<<240,128>>> are okay.
It is pitch black. You are likely to be eaten by a grue.
Seriously, the answer is “it depends on a whole lot of things in really obscure ways.” It is not easy to figure out, and there is no good way to even use this knowledge right now.
Sorry I don’t have the code right now. Actually it is very simple, you can do it in 10 mins. Basically you cudaMalloc() a chunk of memory of size equal to the number of thread blocks, then pass the address to the kernel. In the kernel, thread (0,0,0) of each thread block sets its individual location of the memory, then the block(0,0) checks if all locations are set, if yes, then the block(0,0) sets a “go” flag to notify all thread blocks to exit.