Numer of active threads

Hi,

we have some questions about the maximum number of “active” threads (not necessarily active in terms of the Programming Guide but active in terms that a thread has been started and has not finished yet).

post_active_blocks.txt (1.6 KB)

Attached is a file basically containing 2 kernels which both increment a counter in global memory (using atomicAdd) and then forever run an infinite loop. The first kernel is doing nothing but a __syncthreads() in this loop, while the other is doing some nonsense calculation reading and writing to global memory and a __syncthreads(). Asynchronous (using streams) every 2 seconds the host reads the value of the counter and outputs it. So the value of the counter indicates the number of threads that have been started so far.

Using a grid with dimension (1000,2) and 256 threads per block (512,000 threads alltogether) the output (on a GTX280) using the first kernel is 30,720. Using the second kernel the output (after some time) is 512,000 (when __syncthreads() is not used the output is 30,720 as well, probably __threadfence() could be used instead of __syncthreads() but somehow ptxas crashed when using __threadfence()). One explanation for the difference between the counter-values of the two kernels could be that in the second kernel the access to global memory gives the scheduler more “opportunities” to schedule. However, the question remains how it is possible that there are so many “active” threads when using the second kernel?

The Programming Guide says that the maximum number of active threads (in terms of the Programming Guide) per MP is 1024, thus using a GTX280 the maximum number of active threads is 30*1024=30,720 (which is exactly the value the first kernel hits). The only explanation for the 512,000 “active” threads when using the second kernel we can think of is that active blocks/warps are deactivated and others are activated. Is this what the GPU does? If so where are the values of the registers, the IP etc. of the deactivated threads put?

The second kernel needs at least 1 register per thread (for the variable data) to be kept somewhere even if the thread is “deactivated”. Thus alltogether the values of 512,000 registers have to be stored somewhere (which is more then the 30*16384=491,520 registers available on a GTX280)…

Why don’t you just use cudaMemset instead of:
cudaMallocHost((void**)&new_counter, sizeof(int));
*new_counter = 0;
cudaMemcpy(device_counter, new_counter, sizeof(int), cudaMemcpyHostToDevice);

Couldn’t this line:
tmp[threadIdx.x] /= tmp[(threadIdx.x + data) % blockDim.x] + data;
result in a divide by zero, since you haven’t initialised tmp to be non-zero?
I know you mightn’t care about the result of the calculation, but wouldn’t this line also result in a race condition? Thread X of every MP will be attempting to write to the same location. You mentioned using threadfence_ instead of syncthreads_ here, but I wouldn’t be surprised if doing that with this race condition (if I understood it correctly) caused unexpected behaviour.

When all of the registers have been used, the local memory is probably being used. It is quite surprising that the GPU is apparently deactivating threads before they have finished, and that calling syncthreads_ should prevent this from happening.

Copying memory when it’s in use is so ridiculously undefined…

thanks for your replies.

@beachhut
you’re right, using cudaMemset would have been better ;-)
Division by zero and/or race conditions could be a reason for our results. Maybe that causes threads to crash and thus others to start. But we modified our kernel avoiding division by zero and race conditions due to threads accessing the same address in global memory and the results remain the same… Maybe there are other sideeffects that causes threads to crash? Why do we get different results when we’re not using __syncthreads()?

post_active_blocks.txt (2.1 KB)

@tmurray
could that be an explanation for our results? One would exspect that copying memory (that is in use) from device to host is some kind of read access to memory. And the only “bad” thing that could happen is that “wrong” values are copied (because a write-operation is in progress). In addition, if we let the host-programm sleep long enough, the first memory access reads the final counter value - so probably the first read access (from host) occurs after the last write access (from device).
Or are there other harmful thinks that could happen? Threads crashing? Some kind of “Heisenberg uncertainty principle” ;-)