uchar4 optimization bug when writing to global memory

PINS · March 24, 2008, 1:42am

Hello, long time no post…

Since I’ve been away for a while, don’t know if this is a known issue or not. I’ve found a small optimization bug when using uchar4 to write to global memory, as in a RGBA framebuffer. It only happens when there is a conversion from float to uchar.

It’s really simple: whenever I intermix local variables with constant values, the compiler chooses to not use st.global.v4.u8 but several smaller stores. As a result, there is no coalescing and performance drops to at least half.

Code that is optimized:

globalMemPtr[pixelAddress] = make_uchar4( 255, 255, 255, 255 );

or

globalMemPtr[pixelAddress] = make_uchar4( r, g, b, a );

Code that fails to be optimized:

float r = 255.0f;

globalMemPtr[pixelAddress] = make_uchar4( r, 255, 255, 255 );

This last one generates something like this in .ptx:

mov.f32  $f1, 0f437f0000;      //  255

cvt.rzi.u32.f32  $r13, $f1;    //  

st.global.u8  [$r12+0], $r13;  //  id:47

mov.u16  $rh5, 255;            //  

st.global.u8  [$r12+1], $rh5;  //  id:48

mov.u16  $rh6, 255;            //  

mov.u16  $rh7, 255;            //  

st.global.v2.u8  [$r12+2], {$rh6,$rh7};	//

Thanks for listening!

kristleifur · March 24, 2008, 11:50am

Interesting stuff, thanks for posting. I’m curious, what does the compiler happen to do when it sees â€¦

float r = 255.0f;

globalMemPtr[pixelAddress] = make_uchar4( (int) r, 255, 255, 255 );

â€¦ this? (The difference is the (int) r cast.)

m11 · March 24, 2008, 4:34pm

very interesting !

I tried that, nothing helped, no idea what to do better than wait for sevice pack or mess with assembler…

tried unsigned char u=floor®,…
and (int) r and (unsigned char) r … result is the same (small writes)

neither this helped
//test
float r=255.0f;
unsigned char u;
uchar4 u4;
u=floor®;
u4=make_uchar4( u, 255, 255, 255 );
int pixelAddress;
pixelAddress=75;
globalMemPtr[pixelAddress] = u4;
//test-end

<img src=‘http://hqnveipbwb20/public/style_emoticons/<#EMO_DIR#>/crying.gif’ class=‘bbc_emoticon’ alt=‘:’(’ />

PINS · March 24, 2008, 9:00pm

Well, the workaround is simple: force yourself to use all uchars or all floats, as the compiler likes them better :P

I’m using these for the moment:

static inline __device__ uchar4 make_color( unsigned char r, unsigned char g, unsigned char b )

{

    return make_uchar4( r, g, b, 255 );

}

static inline __device__ uchar4 make_color( unsigned char value )

{

    return make_uchar4( value, value, value, 255 );

}

static inline __device__ uchar4 make_color( float3 color, float alpha )

{

    return make_uchar4( color.x * 255.0f, color.y * 255.0f, color.z * 255.0f, alpha * 255.0f );

}

Didn’t have time to test with other packed types, maybe they’re ok.

Topic		Replies	Views
uchar4 CUDA Programming and Performance	0	5553	March 16, 2009
why is writing int faster than uchar4? (gets 64b writes instead of 32b) CUDA Programming and Performance	0	7459	July 29, 2010
Fastest way to convert float4 to uchar4? Texture conversion CUDA Programming and Performance	3	5500	April 23, 2010
float4 in a register? CUDA Programming and Performance	4	2041	February 5, 2015
nvcc horribly breaking float4 read CUDA Programming and Performance	5	9769	June 25, 2008
Reading from float4 corrupts memory CUDA Programming and Performance cuda	18	272	October 3, 2024
Is it possible to avoid optimization? Compiler breaks memory coalescing. CUDA Programming and Performance	7	3232	March 14, 2008
Preferred method of updating float4 type in global memory CUDA Programming and Performance	3	598	November 26, 2014
Random loads of small unsigned integer values, __constant__ vs. __shared__ vs. global CUDA Programming and Performance	13	1529	August 9, 2015
Serialized warp when accessing ushort4 items CUDA Programming and Performance	1	961	May 9, 2013

uchar4 optimization bug when writing to global memory

Related topics