Assume I have a kernel that ballpark looks like this:
__global__ void myKernel(unsigned int* upperBounds) {
// some stuff here
for(int i=0; i<upperBounds[threadIdx.x]; i++) {
// do something here...
}
// some more stuff here
}
Imagine that upperBounds has all sorts of values, say, in this order: 4, 12, 6, 4, 23, etc.
How is the thread divergence kicking in here?
To be more precise, assume that all entries in the upperBound array are greater than 4. Will I have no thread divergence for the first four trips in the for loop? Or, will I have thread divergence right away, and for all purposes the first thread will take care of its for-loop, after which the second thread takes care of its for-loop, etc. In other words, will I have a 32-way thread divergence right off the bat, or is it that the thread divergence only kicks in after the fourth trip in the for-loop?
I can see that there is no thread divergence if the for-loop has something like (say N=50 for all threads)
for(int i=0; i<N; i++)
but this is not the case here - the number of trips in the for-loop is thread specific.
Thanks for your time.