The number of coalesced stores reported by the profiler (1.0 or 1.1) doesn’t seem to match my understanding.
For a very simple case:
I have declared two global memory pointers of type float - float* data_1, float* data_2.
Allocated memory to them as follows:
long d_size = sizeof(float) * 8192;
size_t pitch = 128;
CUDA_SAFE_CALL( cudaMallocPitch( (void**) &data_1, &pitch, d_size, 1));
CUDA_SAFE_CALL( cudaMallocPitch( (void**) &data_2, &pitch, d_size, 1));
the dimgrid and dimblock are set as follows:
dim3 dimBlock(1, 1, 1);
dim3 dimGrid(1, 1, 1);
And the instructions in the kernel code is:
unsigned int x = blockIdx.x*blockDim.x + threadIdx.x
data_2 = data_1 * 2;
The profiler returns : gld coalesced = 1 and gst coalesced = 2
if I change to 32 threads in one block: dim3 dimBlock(32, 1, 1);
The profiler returns : gld coalesced = 2 and gst coalesced = 8
What am I missing. Why aren’t the number of loads equal to the number of stores?
Even if I read from data_2 and write to data_1 (to check if the global memory alignment is the problem – I still get the above reported number of loads and stores, respectively)
Any insights?
I am using the 280 GTX card (1.3 compute capability) on a Linux (Fedora 3) machine and CUDA 2.0. Both profilers - 1.0 and 1.1 return the same values.
Thanks -
kpg