we have some questions about the maximum number of “active” threads (not necessarily active in terms of the Programming Guide but active in terms that a thread has been started and has not finished yet).
post_active_blocks.txt (1.6 KB)
Attached is a file basically containing 2 kernels which both increment a counter in global memory (using atomicAdd) and then forever run an infinite loop. The first kernel is doing nothing but a __syncthreads() in this loop, while the other is doing some nonsense calculation reading and writing to global memory and a __syncthreads(). Asynchronous (using streams) every 2 seconds the host reads the value of the counter and outputs it. So the value of the counter indicates the number of threads that have been started so far.
Using a grid with dimension (1000,2) and 256 threads per block (512,000 threads alltogether) the output (on a GTX280) using the first kernel is 30,720. Using the second kernel the output (after some time) is 512,000 (when __syncthreads() is not used the output is 30,720 as well, probably __threadfence() could be used instead of __syncthreads() but somehow ptxas crashed when using __threadfence()). One explanation for the difference between the counter-values of the two kernels could be that in the second kernel the access to global memory gives the scheduler more “opportunities” to schedule. However, the question remains how it is possible that there are so many “active” threads when using the second kernel?
The Programming Guide says that the maximum number of active threads (in terms of the Programming Guide) per MP is 1024, thus using a GTX280 the maximum number of active threads is 30*1024=30,720 (which is exactly the value the first kernel hits). The only explanation for the 512,000 “active” threads when using the second kernel we can think of is that active blocks/warps are deactivated and others are activated. Is this what the GPU does? If so where are the values of the registers, the IP etc. of the deactivated threads put?
The second kernel needs at least 1 register per thread (for the variable data) to be kept somewhere even if the thread is “deactivated”. Thus alltogether the values of 512,000 registers have to be stored somewhere (which is more then the 30*16384=491,520 registers available on a GTX280)…