As a newbie to CUDA programming, the concept of SIMT( single instruction multiple data) is quite confusing to me. To my understanding a CUDA code gets the best performance if all threads execute the exact same instructions. But how about in the case that different threads have to execute different number of instructions? For example in the following function
global void (int *n, int *a){
const int tid = blockIdx.x*blockDim.x + threadIdx.x;
for (int i = 0; i < n[tid] ; i++){
a[tid] += i;
}
}
Case 1) the array n has been assigned with varying number from 10 to 1000.
Case 2) the array n has the exact same number 1000.
I expected that case 1) violate the SIMT rule and will achieve the worse performance. However, the testing results gave me a very similar performance for both case 1) and case 2).
It is very confusing. Does any body know why I get this very similar performance for two cases? Thanks!