warp divergence triggered by for loop

Assume I have a kernel that ballpark looks like this:

__global__ void myKernel(unsigned int* upperBounds) {

   // some stuff here

   for(int i=0; i<upperBounds[threadIdx.x]; i++) {
      // do something here...
   }

   // some more stuff here

}

Imagine that upperBounds has all sorts of values, say, in this order: 4, 12, 6, 4, 23, etc.

How is the thread divergence kicking in here?

To be more precise, assume that all entries in the upperBound array are greater than 4. Will I have no thread divergence for the first four trips in the for loop? Or, will I have thread divergence right away, and for all purposes the first thread will take care of its for-loop, after which the second thread takes care of its for-loop, etc. In other words, will I have a 32-way thread divergence right off the bat, or is it that the thread divergence only kicks in after the fourth trip in the for-loop?

I can see that there is no thread divergence if the for-loop has something like (say N=50 for all threads)

for(int i=0; i<N; i++)

but this is not the case here - the number of trips in the for-loop is thread specific.

Thanks for your time.

You will have no “divergence” for the first 4 iterations of the loop, since all threads will pass the for-loop conditional test for those 4 iterations. After that, the execution behavior will be determined by which threads have passed the for-loop conditional test for that particular loop iteration for a particular warp. Those threads in a warp which have passed the test for that particular loop iteration will execute together in lock-step, while the other threads in the warp will be inactive.

Awesome - thanks for clarifying this…