I do a experinment about the SM execute 2 instreuctions. Iuse a vector add to test, the total threads is 512. For one instruction case, I make all 512 threads to do one work and for 2 instruction case, I make first 256 threads to do first work and another 256 threads to do second work
the code for one instruction:
while (id < n) {
for (k = 0; k < 18000; k++)
c[id] = a[id] + b[id];
id += gridDim.x*blockDim.x;
the coda for 2 instructions likes following
while (id < n) {
for (k = 0; k < 18000; k++)
{
if (id < 256){
c[id] = a[id] + b[id];
}
else {
u[id] = a[id] + b[id];
}
}
we can see in two instructions case, the 2 instruction is same.
Then I use the cuda profiler to test the execution time of kernel
the 2 instruction: 13ms and 1 instruction is 11.8ms
and I want to know why the one instruction case has less execution time? I assume this 2 cases should have same execution tiem because in 2 in structions cae the 2 instructions are same
I am so confused , could anybody give me some help