simple element wise addition more stores than reads?


I’ve got a really simple kernel adding up elements of one vector to another one and storing it back to the second.

__global__ void add(cufftComplex *src, cufftComplex *dst)


  int index = threadIdx.x + blockIdx.x*blockDim.x + blockIdx.y*blockDim.x*gridDim.x;

  dst[index] = cuCaddf(src[index],dst[index]);


Now when I look at the Visual Profiler output I see that all my reads and writes are coallesced (like expected), but I’ve got n_writes = 4n_reads. I would expect n_reads = 2n_writes for reading from 2 arrays and writing back to one.

Any explanation?



I’ve seen a similar thing with a slightly more complicated kernel. IIRC the number of reads was twice as many as expected, while the number of writes was correct.

What hardware are you using?

I’m working on a Tesla C870.

Even changing the calculation to float-wise or float4-wise accesses (for it doesn’t matter how much is calculated in one thread) doesn’t change these numbers.