a question about SIMT

As a newbie to CUDA programming, the concept of SIMT( single instruction multiple data) is quite confusing to me. To my understanding a CUDA code gets the best performance if all threads execute the exact same instructions. But how about in the case that different threads have to execute different number of instructions? For example in the following function

global void (int *n, int *a){

const int tid = blockIdx.x*blockDim.x + threadIdx.x;

for (int i = 0; i < n[tid] ; i++){

   a[tid] += i;


Case 1) the array n has been assigned with varying number from 10 to 1000.
Case 2) the array n has the exact same number 1000.

I expected that case 1) violate the SIMT rule and will achieve the worse performance. However, the testing results gave me a very similar performance for both case 1) and case 2).

It is very confusing. Does any body know why I get this very similar performance for two cases? Thanks!

Case 1 is correct. However there are 8 computing units (per SM) executing only 32 threads in SIMD fashon. Those groups of 32 threads are called warps. If your block contains more threads (and usually it should), you will have several waprs interleaving each other on those “poor” 8 units. However 2 different warps may execute completly different instructions. That’s why in your case, although your performance will degrade, it won’t hurt so much.