uchar4 optimization bug when writing to global memory

Hello, long time no post…

Since I’ve been away for a while, don’t know if this is a known issue or not. I’ve found a small optimization bug when using uchar4 to write to global memory, as in a RGBA framebuffer. It only happens when there is a conversion from float to uchar.

It’s really simple: whenever I intermix local variables with constant values, the compiler chooses to not use st.global.v4.u8 but several smaller stores. As a result, there is no coalescing and performance drops to at least half.

Code that is optimized:

globalMemPtr[pixelAddress] = make_uchar4( 255, 255, 255, 255 );


globalMemPtr[pixelAddress] = make_uchar4( r, g, b, a );

Code that fails to be optimized:

float r = 255.0f;

globalMemPtr[pixelAddress] = make_uchar4( r, 255, 255, 255 );

This last one generates something like this in .ptx:

mov.f32  $f1, 0f437f0000;      //  255

cvt.rzi.u32.f32  $r13, $f1;    //  

st.global.u8  [$r12+0], $r13;  //  id:47

mov.u16  $rh5, 255;            //  

st.global.u8  [$r12+1], $rh5;  //  id:48

mov.u16  $rh6, 255;            //  

mov.u16  $rh7, 255;            //  

st.global.v2.u8  [$r12+2], {$rh6,$rh7};	//  

Thanks for listening!

Interesting stuff, thanks for posting. I’m curious, what does the compiler happen to do when it sees …

float r = 255.0f;

globalMemPtr[pixelAddress] = make_uchar4( (int) r, 255, 255, 255 );

… this? (The difference is the (int) r cast.)

very interesting !

I tried that, nothing helped, no idea what to do better than wait for sevice pack or mess with assembler…

tried unsigned char u=floor®,…
and (int) r and (unsigned char) r … result is the same (small writes)

neither this helped
float r=255.0f;
unsigned char u;
uchar4 u4;
u4=make_uchar4( u, 255, 255, 255 );
int pixelAddress;
globalMemPtr[pixelAddress] = u4;

<img src=‘http://hqnveipbwb20/public/style_emoticons/<#EMO_DIR#>/crying.gif’ class=‘bbc_emoticon’ alt=’:’(’ />

Well, the workaround is simple: force yourself to use all uchars or all floats, as the compiler likes them better :P

I’m using these for the moment:

static inline __device__ uchar4 make_color( unsigned char r, unsigned char g, unsigned char b )


    return make_uchar4( r, g, b, 255 );


static inline __device__ uchar4 make_color( unsigned char value )


    return make_uchar4( value, value, value, 255 );


static inline __device__ uchar4 make_color( float3 color, float alpha )


    return make_uchar4( color.x * 255.0f, color.y * 255.0f, color.z * 255.0f, alpha * 255.0f );


Didn’t have time to test with other packed types, maybe they’re ok.