I was running an arbitrary benchmark test to see if having 2 elements per thread was going to be faster. I was trying to test the 400-600 cycle global memory access penalty and if CUDA code would have performance changes with latency masking. I got similar results with the following loop doing silly math.
for( x = 0; x < ARR_SIZE; ++x )
I know, x+=2 is better, but that’s not the point. d_arr is just an array of ints, and gridThreadId is a position in the array in global memory that we’re acessing.
Doing this, we get an average of 19 million clock cycles if I call the kernel 1000 times for this one and the identical kernel which sports only one operation per iteration in the loop.
Running the same code on an identical machine in windows XP using Visual Studio yields only 9 million clock cycles! VS uses compiler optimization level 2 while we used default and level 3 optimization for the linux box. I thought it would be similar since they are both running on the GPU almost exclusively (besides passing memory between the host and device).