How to increase compute utilization if your kernel doesn't have much arithmetic operations?

Okay, so my kernel A is reading two variables from device memory. Performing their Ex-OR and storing them back to device memory.
global void A(int dr,int p,int q,int NORi)
const int tid = blockDim.x * blockIdx.x + threadIdx.x;
const int N = blockDim.x
for(int i=0;i<NORi;i++)
dr[(i+p)*N +tid] = dr[i*N+tid] ^ dr[(i+q)*N+tid];
The memory access is coalesced. However, there is only one arithmetic operation. Profiler shows the same, i.e., compute utilization of this kernel is very low.
However, some other similar kernel B which is performing a lot of extra computations on those variables instead of just one Ex-or and then storing them back to device memory is faster than Kernel A. I understand If compute utilization is low, most of the time all the warps will be waiting for memory accesses to get done instead of doing some computation. But how come kernel B which has same no. of memory accesses per thread and extra computation doing everything faster?

Also, I tried to add a lot of synthetic arithmetic operations in the kernel A, but the profiler still shows the same compute utilization? What is actually happening there?

This is very simple case. You can not do much. You can use launch bounds and specify that you want mininum amount of blocks per smp active so that you you in total the maximum amount of threads pe smp. For cc 3.x the maximum amount of threads per smp is 2048 so you can havbe 8 block with 256 threads perblokc or 4 with 512 or 22 with 1024. For cc 2.x you have 1536 max active threads per smp which means that if you have 1024 threads per block you do not achieve max occupancy. So you have to use 512 or 256 and have 3 or 7 active blocks per smp.