Cuda Profiler 1.1 - question on gst coalesced value

The number of coalesced stores reported by the profiler (1.0 or 1.1) doesn’t seem to match my understanding.

For a very simple case:
I have declared two global memory pointers of type float - float* data_1, float* data_2.

Allocated memory to them as follows:

long d_size = sizeof(float) * 8192;
size_t pitch = 128;
CUDA_SAFE_CALL( cudaMallocPitch( (void**) &data_1, &pitch, d_size, 1));
CUDA_SAFE_CALL( cudaMallocPitch( (void**) &data_2, &pitch, d_size, 1));

the dimgrid and dimblock are set as follows:
dim3 dimBlock(1, 1, 1);
dim3 dimGrid(1, 1, 1);

And the instructions in the kernel code is:
unsigned int x = blockIdx.x*blockDim.x + threadIdx.x
data_2 = data_1 * 2;

The profiler returns : gld coalesced = 1 and gst coalesced = 2

if I change to 32 threads in one block: dim3 dimBlock(32, 1, 1);
The profiler returns : gld coalesced = 2 and gst coalesced = 8

What am I missing. Why aren’t the number of loads equal to the number of stores?
Even if I read from data_2 and write to data_1 (to check if the global memory alignment is the problem – I still get the above reported number of loads and stores, respectively)

Any insights?

I am using the 280 GTX card (1.3 compute capability) on a Linux (Fedora 3) machine and CUDA 2.0. Both profilers - 1.0 and 1.1 return the same values.

Thanks -
kpg

I think I remember what they mean, but you don’t want to use them anyway because they are pretty useless on GT200. You should grab the 2.2 beta from the registered developer site and use the new counters that measure individual memory transaction sizes.