I have a very simple kernel:
global void testKernel ( float* array)
array[threadIdx.x] = array[threadIdx.x] + 1.0f;
When I run this kernel with 1 thread in 1 block for 1 iteration, I get:
gld_request = 2
gld_inst_32bit = 2
gst_request = 2
gst_inst_32bit = 2
When I run this kernel with 32 and 64 threads (size of 1 and 2 warps on my card) I get
gld_request = 2 and 4
gld_inst_32bit = 2 and 4
gst_request = 64 and 128
gst_inst_32bit = 64 and 128
These values all appear to be double what I’d expect. The same pattern continues for higher numbers of threads and thread blocks. Does anyone know why these values appear doubled?
These tests were done on a GTX 560 Ti, with CUDA 4.2.