Hi,
I want to speed up the copy from global memory to shared memory by packing two floats into one int.
My first approach:
__global__ test( int* globalInt )
{
[..]
// Unpack
int intVal = globalInt[ index ];
float floatVal1 = __int_as_float( intVal >> 16 );
float floatVal2 = __int_as_float( intVal & 0xffff );
//Do some calculations
// Pack
intVal = __float_as_int( floatVal1 ) << 16;
intVal += __float_as_int( floatVal2 ) & 0xffff;
}
But this doesn’t work. The float values are between -32768 and 32768.
Can somebody give me a hint?
Kind regards
Jens
int_as_float is a bitwise interpretation… it doesn’t cast the value from a float to int, it treats the literal float bits as an integer.
What you want is just a simple classical cast, which can even be implicit.
float floatVal1 = intVal >> 16;
float floatVal2 = intVal & 0xffff;
//Do some calculations
// Pack
intVal = ((int)floatVal1) << 16;
intVal += ((int)floatVal2) & 0xffff;
note that these conversions will only work if your values really are in range!
Ok, thank you for your fast help.
Do you know how I can handle negativ values?
If for example floatValue1 is -7 before packing ,
I receive 65529 for floatValue1 after unpacking.
Kind regards,
Jens
When you AND an integer it returns an unsigned int. Cast it back to a signed int and it should work.
Hi,
do you mean something like this?
float floatVal1 = (int)intVal >> 16;
float floatVal2 = (int)(intVal & 0xffff);
//Do some calculations
// Pack
intVal = ((int)floatVal1) << 16;
intVal = result | ((int)floatVal2) & 0xffff;
This doesn’t work in my case.
Kind regards
Jens
You can multiply a couple floats into a short2 struct, then divide the values by the same factor. The multiplication factor to use depends on how many digits you need after the dot.
Hey, thanks for your tip,
my code looks like this now:
__global__ test( short2* gInput )
{
[..]
// Unpack
float floatVal1 = __fdividef( gInput[ index ].x, 10000.f );
float floatVal2 = __fdividef( gInput[ index ].y, 10000.f );
//Do some calculations
// Pack
gInput[ index ].x = (int)( floatVal1 * 10000.f )
gInput[ index ].y = (int)( floatVal2 * 10000.f )
}
This approach works, but is even a bit slower than using float2 without these conversions.
Do I have any bottlenecks in my code or is it simply not faster to read from short2 instead of float2?
Kind regards,
Jens
Float2 should be quite OK for speed, I don’t know exactly … all I have is that the coalescing chapter in the manual indicates that float2 isn’t that bad.
Actually, I’d try doing only one read from the short2 global array into a shared short2, and then dividing out of that. The reverse for output. I’d do this optimisation if I were the compiler but who knows …
My guess is that you’d have to be severely bandwidth limited for this to help.