Uncoalesced reads; Coalesced writes Same access pattern; differenct coalesced I/O outcome?

I’m a moderately experienced CUDA programmer, and have been optimizing code for multiple compute levels in a single program for a while. I understand the rules for coalesced reads and writes, especially the more stringent ones for compute level 1.1 and below. That being said, I don’t understand while the code I’ve profiled below has almost all reads uncoalesced. It’s called with 16 x 16 thread blocks, and as many blocks as needed to handle the 1920 x 1080 HD monochrome image I’m processing (120 x 68). Here’s a section of the Profiler output showing the 129k coalesced loads, 4.15 million uncoalesced loads, 0 uncoalesced writes and 518,400 coalesced writes.

186076 Normalize 32292.8 32319.5 1 8160 4.1472e+06 129600 0 518400

I looked at the address of the src pointer and it’s a multiple of 64 bytes, so it should be aligned (especially since I cudaMallocPitch’d it). The float3 stats structure holds the min, max and sum in .x, .y and .z variables. This function normalizes the src array to have a range of 0 to 1 (another function already figured out the min and max), so all it does is read in the data, scale it, and write it back out. Pretty simple, and more importantly, indexes a 4 byte variable that indexes along with threadIdx.x. This function should be perfectly coalesced (except perhaps overflow at the ends). Any ideas on why this doesn’t coalesce on the loads?

global_ void Normalize( float dest, size_t dest_pitch, float src, size_t src_pitch, const uint2 imgsize, const float3 stats)
{
unsigned int i = (blockIdx.y * blockDim.y) + threadIdx.y;
unsigned int j = (blockIdx.x * blockDim.x) + threadIdx.x;
// Are we inside the valid image area?
if ( j<imgsize.x && i<imgsize.y ) {
float rowptr=(float)((char
)src+i
src_pitch);
float unscaled=rowptr[j];
rowptr=(float
)((char*)dest+i*dest_pitch);
rowptr[j]=(unscaled-stats[0].x)/(stats[0].y-stats[0].x);
} // End if inside image area
__syncthreads();
} // End function Normalize

Is src_pitch 1920(or some other multiple of 128)? src pointer should be a multiple of 128 as well, not just 64.

The pitch of src is 7680 (1920 * 4 byte float), and an address assigned on one particular run was 43778048, which is 128 byte aligned (actually, it’s better than 256 byte aligned), so the alignment doesn’t appear to be the problem.

Actually, while writing this, I’ve just realized what the problem is. Has nothing to do with the *src, and everything to do with the last parameter being called (float3 *stats). That is a single float3 element, which is passed by reference, and not value. Hence, everytime it gets mentioned, there’s a global memory fetch. Since every thread in a thread block needs this one (and only) value, it looks like I’ve got two options:
(1) Use the first thread of every block to load these values into a shared memory variable, sync, then proceed
(2) Copy the contents of stats to the constant space before the call to this kernel and hope the cache alleviates the multiple fetches.

I might actually try both methods just to see which one works better.

True about stats global mem access causing the problem. Sounds like a perfect candidate for const memory.
You might even precalculate “scale=1.0f/(stats[0].y-stats[0].x)” and “bias=-stats[0].x/(stats[0].y-stats[0].x)” to perform the operation in a single MAD: “result = scale*input + bias”.

I tried using the constant memory, but that was actually quite slow also. It still suffered the uncoalesced loads like the ‘global’ memory. I declared it with constant device modifiers, and it didn’t seem to perform as hoped.

I tried the single load per block using only the first thread into shared memory, and that worked pretty well. But the best (fastest) solution was actually to bring the result back out to the host via a cudaMemCpy(…DeviceToHost) into a float3 variable, then call the kernel with a value rather than a pointer (7802 clocks vs. 5342 from the profiler). These seems to be because of all divergent branches handling the special case of the first thread, as well as each block’s load being uncoalesced (even though there’s only 8160 loads instead of 8160 * 256). I was quite stunned by this result. I figured the two way trip across the PCI-E bus would be slower than simply accessing device memory. This is running on a slower, lower compute card (1.1), but I’ll try it on a Tesla C1060 to see if the same holds there.

Precalculated the offset and scale as you recommended, and indeed it’s faster, but only about 1%. Not terribly surprising since that operation was done in parallel anyway. I actually thought the GPU might do it faster since I was using the faster intrinsic __fdividef call with reduced precision, but this was not the case.

Thanks for your time and effort Nighthawk, it’s greatly appreciated.