Cuda profiler: global memory counters are 2x the number of operations

I have a very simple kernel:

global void testKernel ( float* array)
{
array[threadIdx.x] = array[threadIdx.x] + 1.0f;
}

When I run this kernel with 1 thread in 1 block for 1 iteration, I get:
gld_request = 2
gld_inst_32bit = 2
gst_request = 2
gst_inst_32bit = 2

When I run this kernel with 32 and 64 threads (size of 1 and 2 warps on my card) I get
gld_request = 2 and 4
gld_inst_32bit = 2 and 4
gst_request = 64 and 128
gst_inst_32bit = 64 and 128

These values all appear to be double what I’d expect. The same pattern continues for higher numbers of threads and thread blocks. Does anyone know why these values appear doubled?

These tests were done on a GTX 560 Ti, with CUDA 4.2.

Thanks.

I know why! There’s a bug in my code that is collating the profiler output :|

So these profile counters should have the numbers that you expect them to have.