Weird coalescing results in cuda 3.0 profiler

I’m trying to write up a simple program to read pixels from an image and I decided to use a char4 type to keep with the 4byte alignment criteria for memory coalescing.

Image i size is 640x480, 1 byte per pixel ( grayscale).

Allocated memory using cudaMallocPitch returns a pitch value of 768.

In the kernel, I simply copy from a char4 source to a char4 destination. Both are allocated to the same size.

[codebox]int offset = threadIdx.x + blockIdx.y * pitch;

output[ offset ] = input[ offset ];[/codebox]

BlockDim = 160 x 1 ( basically 1 block reads 1 row of the image )

GridDim = 1 x 480

While theoretically this should allow for coalesced reads within each thread block (which is what cudaMallocPitch is supposed to guarantee), the profiler actually reports this isn’t the case! Note: For some weird reason my profiler reports the number of uncoalsced reads, but not the number of coalesced reads ( which is always zero ).

If I set gridDim to 1 x 3, the profiler reports no uncoalsced reads/writes

If I set gridDim to 1 x 4, profiler says there’s 640 uncoalsced reads and 1280 uncoalsced writes. Another strange thing is how can there be more writes than reads when it’s doing a 1:1 copy?

PS. cuda device is sm1.1