Multiplication gives 24bit results. How to get 32?

tonuonu · February 21, 2009, 7:43pm

Related source code (kernel):

…
char *result_ptr
…
unsigned int k[64];
unsigned int i;

    for (i = 0; i <= 1; i++) {
      k[i] = (uint) (4294967296.0* __sinf (i));
      result_ptr[0] = (char) k[i];
      result_ptr[1] = (char) (k[i] >> 8);
      result_ptr[2] = (char) (k[i] >> 16);
      result_ptr[3] = (char) (k[i] >> 24);
      result_ptr[4] = 0;
    }

…

host side output:
printf (“result:‘0x%02x 0x%02x 0x%02x 0x%02x’\n”,
(unsigned char) result_buf[0], (unsigned char) result_buf[1],
(unsigned char) result_buf[2], (unsigned char) result_buf[3]);
…

Output from program:

result:‘0x00 0xa4 0x6a 0xd7’

Expected hex value: 0xd76aa478

What should I do to get 32 multiplication result of two floats? Currently it “eats” least significant bits and I cannot get over it.

I am working in Linux and GPU is Quadro fx5600 is this makes any difference.

EDIT:

cudafe2.gpu file related content:

(((unsigned *)k)[i]) = (__float2uint_rz(((float)((4294967296.0F) * (__sinf(((float)i)))))));

seibert · February 21, 2009, 8:54pm

What you ask is impossible because floats only have 24 bits of mantissa. Since you are using a Quadro FX5600, you don’t have support for doubles, which would give you 53 bits of mantissa. Integer multiplication in CUDA has 32 bit precision, but you have to cast both operands to ints in order to get the compiler to do that.

It looks like you are trying to map the sine function to the full int range, which adds another problem since the sinf() function is also single precision. Even with 32 bit multiplication precision, the sinf() return value will only have 24 bits of precision.

tonuonu · February 21, 2009, 9:04pm

Thanks! “mantissa” was the word I needed (stupid me).

Meanwhile is seems to calculate mostly OK stuff from input and what I need is 32-bit integer ouput. Hint to CUDA developes to introduct __sini() function :)

Topic closed.

tonuonu · February 22, 2009, 9:26am

BTW, just courious: Does someone can come out idea when solution to this problem is necessary and upgrading video card is not an option?

I do not have problem with this at momemt because I am just learning but theoretically it could be a problem. At moment this computation came from MD5 algorithm initialisation routine. It needs to calculate some constants for later use.

SPWorley · February 22, 2009, 10:29am

MD5 only needs double precision sines to initialize a constant 32 bit integer table of just 64 values. The computation is not data dependent, so it’s constant for every MD5 compute. It’s not difficult to have the host generate them and send them over in constant or global memory just once.

In general, it certainly would be possible but probably annoying to compute them using older GPUs without double support by using extended precision tricks but it’d be nontrivial!

G200 GPUs of course can do it all natively with doubles.

tonuonu · February 22, 2009, 12:32pm

I know. But if I read GPU optimisation guides, they often say “recalculate, do not cache” and such. And at moment I am just in learning process, so I try everything even when it is not reasonable. MD5 is dead anyway. Just thinking “How would I do it if this really important to solve”. Currently I am bit out of ideas. Usually result is some really simple and genial math. Probably it is possible to split this float into half, compute products separately and join them. Sometime result is easy, so I ask, maybe someone knows this simple trick.

seibert · February 22, 2009, 2:21pm

In this case, it would not be so easy with compute capability < 1.3. First you would need to use the “double-single” float representation, which creates a “pseudo-double” out of two single precision floats. The psuedo-double only has 48 bits of mantissa, which is good enough for you here. A standard implementation of double-single arithmetic is provided in the dsfun90 library. You can find the port of many of the dsfun90 functions to CUDA by searching the forum for “dsfun90”.

Once you can do basic arithmetic in this double-single representation, then you need to implement a sin() function using these operations. There are many ways to do this, but the most straightforward way would be argument reduction to reduce x to the interval [0, pi/2], followed by Taylor expansion of the function in this region.

As you might imagine, all of these calculations could take hundreds or thousands of operations per sin() evaluation. This is when “compute, don’t cache” is a bad idea, and you should use constant memory. :)

tonuonu · February 22, 2009, 7:52pm

Now I got similar problem again but with unsigned integers context only. Code which works well in PC gives different result in GPU.

PC version, compile as usual:
[url=“http://no.spam.ee/~tonu/kala.c”]http://no.spam.ee/~tonu/kala.c[/url]
works. Output is:

…
Calculated value: d98c1dd4 04b2008f 980980e9 7e42f8ec
Refercence value: d41d8cd9 8f00b204 e9800998 ecf8427e

Values are same, just endianness is opposite. Now same thing for GPU:

[url=“http://no.spam.ee/~tonu/kernel.cu”]http://no.spam.ee/~tonu/kernel.cu[/url] gives real crap.

…
result:‘0xffffca4c 0x3c8bdb47 0x0000d6a0 0x00000000’
Refercence value: d41d8cd9 8f00b204 e9800998 ecf8427e

Code is same. All with 32 bit integers.

Two questions:

Why?
How can I debug it? I want to printf() after every calculation. Any good advices? Maybe allocating huge buffer and adding all output there?

I am banging my head two days here with it. I think it is time to ask :)

tonuonu · February 23, 2009, 12:26pm

Don’t be shy. Tell me “/&%Â¤&% you are so stupid, just reread your code again” or “It should work” or…?

tonuonu · February 28, 2009, 8:13pm

NVIDIA staff, please? Is that your bug or I am stupid or…

Why same C code produces different results on GPU and CPU?

jma · February 28, 2009, 9:00pm

This looks very odd:

[codebox]

global void

kernel (char *charset_ptr, unsigned int charset_len, char *hash_ptr,

int *result_ptr)

{

hash_ptr[4] = charset_ptr[threadIdx.x];

hash_ptr[5] = charset_ptr[blockIdx.x];[/codebox]

The same memory location is assigned over and over from each and every thread, is that really intentional?

tonuonu · March 1, 2009, 8:00am

Definitely bug. This is my first code on CUDA and I cannot think properly yet about his memory stuff. But this part should not result in different code? Still idea to recheck for possible memory corruptions.

Topic		Replies	Views
Precision issue! Wrong result for a multiplication CUDA Programming and Performance	7	1357	April 11, 2012
FMA precision issue CUDA Programming and Performance	9	19358	November 21, 2010
Emulated double precision Double single routine header CUDA Programming and Performance	24	49162	October 18, 2010
Is there a difference between GPU double precision and CPU double precision? CUDA Programming and Performance	14	10732	November 26, 2009
GPU/CPU precision comparison and Kernel instructions question CUDA Programming and Performance	5	679	April 4, 2017
16 bit int multiplication using SIMD / mixed precision CUDA Programming and Performance	7	1857	October 12, 2021
Precision of floats does CUDA use half precision instead of single precision for floats? CUDA Programming and Performance	5	2284	March 15, 2010
Why accuracy CPU and GPU not equal? CUDA Programming and Performance	6	10959	October 28, 2014
Cuda code performance CUDA Programming and Performance	14	3147	December 16, 2014
Bug with integer division? CUDA Programming and Performance	33	9351	September 9, 2015

Multiplication gives 24bit results. How to get 32?

PC version, compile as usual: [url=“http://no.spam.ee/~tonu/kala.c”]http://no.spam.ee/~tonu/kala.c[/url] works. Output is:

… Calculated value: d98c1dd4 04b2008f 980980e9 7e42f8ec Refercence value: d41d8cd9 8f00b204 e9800998 ecf8427e

[url=“http://no.spam.ee/~tonu/kernel.cu”]http://no.spam.ee/~tonu/kernel.cu[/url] gives real crap.

… result:‘0xffffca4c 0x3c8bdb47 0x0000d6a0 0x00000000’ Refercence value: d41d8cd9 8f00b204 e9800998 ecf8427e

Related topics

PC version, compile as usual:
[url=“http://no.spam.ee/~tonu/kala.c”]http://no.spam.ee/~tonu/kala.c[/url]
works. Output is:

…
Calculated value: d98c1dd4 04b2008f 980980e9 7e42f8ec
Refercence value: d41d8cd9 8f00b204 e9800998 ecf8427e

…
result:‘0xffffca4c 0x3c8bdb47 0x0000d6a0 0x00000000’
Refercence value: d41d8cd9 8f00b204 e9800998 ecf8427e