Hello, long time no post…
Since I’ve been away for a while, don’t know if this is a known issue or not. I’ve found a small optimization bug when using uchar4 to write to global memory, as in a RGBA framebuffer. It only happens when there is a conversion from float to uchar.
It’s really simple: whenever I intermix local variables with constant values, the compiler chooses to not use st.global.v4.u8 but several smaller stores. As a result, there is no coalescing and performance drops to at least half.
Code that is optimized:
globalMemPtr[pixelAddress] = make_uchar4( 255, 255, 255, 255 );
or
globalMemPtr[pixelAddress] = make_uchar4( r, g, b, a );
Code that fails to be optimized:
float r = 255.0f;
globalMemPtr[pixelAddress] = make_uchar4( r, 255, 255, 255 );
This last one generates something like this in .ptx:
mov.f32 $f1, 0f437f0000; // 255
cvt.rzi.u32.f32 $r13, $f1; //
st.global.u8 [$r12+0], $r13; // id:47
mov.u16 $rh5, 255; //
st.global.u8 [$r12+1], $rh5; // id:48
mov.u16 $rh6, 255; //
mov.u16 $rh7, 255; //
st.global.v2.u8 [$r12+2], {$rh6,$rh7}; //
Thanks for listening!