Okay, so my kernel A is reading two variables from device memory. Performing their Ex-OR and storing them back to device memory.

**global** void A(int *dr,int p,int q,int NORi)
{
const int tid = blockDim.x * blockIdx.x + threadIdx.x;
const int N = blockDim.x*gridDim.x;

for(int i=0;i<NORi;i++)

{

dr[(i+p)*N +tid] = dr[i*N+tid] ^ dr[(i+q)*N+tid];

}

}

The memory access is coalesced. However, there is only one arithmetic operation. Profiler shows the same, i.e., compute utilization of this kernel is very low.

However, some other similar kernel B which is performing a lot of extra computations on those variables instead of just one Ex-or and then storing them back to device memory is faster than Kernel A. I understand If compute utilization is low, most of the time all the warps will be waiting for memory accesses to get done instead of doing some computation. But how come kernel B which has same no. of memory accesses per thread and extra computation doing everything faster?

Also, I tried to add a lot of synthetic arithmetic operations in the kernel A, but the profiler still shows the same compute utilization? What is actually happening there?