I’m trying to write up a simple program to read pixels from an image and I decided to use a char4 type to keep with the 4byte alignment criteria for memory coalescing.
Image i size is 640x480, 1 byte per pixel ( grayscale).
Allocated memory using cudaMallocPitch returns a pitch value of 768.
In the kernel, I simply copy from a char4 source to a char4 destination. Both are allocated to the same size.
[codebox]int offset = threadIdx.x + blockIdx.y * pitch;
output[ offset ] = input[ offset ];[/codebox]
BlockDim = 160 x 1 ( basically 1 block reads 1 row of the image )
GridDim = 1 x 480
While theoretically this should allow for coalesced reads within each thread block (which is what cudaMallocPitch is supposed to guarantee), the profiler actually reports this isn’t the case! Note: For some weird reason my profiler reports the number of uncoalsced reads, but not the number of coalesced reads ( which is always zero ).
If I set gridDim to 1 x 3, the profiler reports no uncoalsced reads/writes
If I set gridDim to 1 x 4, profiler says there’s 640 uncoalsced reads and 1280 uncoalsced writes. Another strange thing is how can there be more writes than reads when it’s doing a 1:1 copy?
PS. cuda device is sm1.1