Multiplication gives 24bit results. How to get 32?

Related source code (kernel):


char *result_ptr

unsigned int k[64];
unsigned int i;

    for (i = 0; i <= 1; i++) {
      k[i] = (uint) (4294967296.0* __sinf (i));
      result_ptr[0] = (char) k[i];
      result_ptr[1] = (char) (k[i] >> 8);
      result_ptr[2] = (char) (k[i] >> 16);
      result_ptr[3] = (char) (k[i] >> 24);
      result_ptr[4] = 0;
    }

host side output:
printf (“result:‘0x%02x 0x%02x 0x%02x 0x%02x’\n”,
(unsigned char) result_buf[0], (unsigned char) result_buf[1],
(unsigned char) result_buf[2], (unsigned char) result_buf[3]);

Output from program:

result:‘0x00 0xa4 0x6a 0xd7’

Expected hex value: 0xd76aa478

What should I do to get 32 multiplication result of two floats? Currently it “eats” least significant bits and I cannot get over it.

I am working in Linux and GPU is Quadro fx5600 is this makes any difference.

EDIT:

cudafe2.gpu file related content:

(((unsigned *)k)[i]) = (__float2uint_rz(((float)((4294967296.0F) * (__sinf(((float)i)))))));

What you ask is impossible because floats only have 24 bits of mantissa. Since you are using a Quadro FX5600, you don’t have support for doubles, which would give you 53 bits of mantissa. Integer multiplication in CUDA has 32 bit precision, but you have to cast both operands to ints in order to get the compiler to do that.

It looks like you are trying to map the sine function to the full int range, which adds another problem since the sinf() function is also single precision. Even with 32 bit multiplication precision, the sinf() return value will only have 24 bits of precision.

Thanks! “mantissa” was the word I needed (stupid me).

Meanwhile is seems to calculate mostly OK stuff from input and what I need is 32-bit integer ouput. Hint to CUDA developes to introduct __sini() function :)

Topic closed.

BTW, just courious: Does someone can come out idea when solution to this problem is necessary and upgrading video card is not an option?

I do not have problem with this at momemt because I am just learning but theoretically it could be a problem. At moment this computation came from MD5 algorithm initialisation routine. It needs to calculate some constants for later use.

MD5 only needs double precision sines to initialize a constant 32 bit integer table of just 64 values. The computation is not data dependent, so it’s constant for every MD5 compute. It’s not difficult to have the host generate them and send them over in constant or global memory just once.

In general, it certainly would be possible but probably annoying to compute them using older GPUs without double support by using extended precision tricks but it’d be nontrivial!

G200 GPUs of course can do it all natively with doubles.

I know. But if I read GPU optimisation guides, they often say “recalculate, do not cache” and such. And at moment I am just in learning process, so I try everything even when it is not reasonable. MD5 is dead anyway. Just thinking “How would I do it if this really important to solve”. Currently I am bit out of ideas. Usually result is some really simple and genial math. Probably it is possible to split this float into half, compute products separately and join them. Sometime result is easy, so I ask, maybe someone knows this simple trick.

In this case, it would not be so easy with compute capability < 1.3. First you would need to use the “double-single” float representation, which creates a “pseudo-double” out of two single precision floats. The psuedo-double only has 48 bits of mantissa, which is good enough for you here. A standard implementation of double-single arithmetic is provided in the dsfun90 library. You can find the port of many of the dsfun90 functions to CUDA by searching the forum for “dsfun90”.

Once you can do basic arithmetic in this double-single representation, then you need to implement a sin() function using these operations. There are many ways to do this, but the most straightforward way would be argument reduction to reduce x to the interval [0, pi/2], followed by Taylor expansion of the function in this region.

As you might imagine, all of these calculations could take hundreds or thousands of operations per sin() evaluation. This is when “compute, don’t cache” is a bad idea, and you should use constant memory. :)

Now I got similar problem again but with unsigned integers context only. Code which works well in PC gives different result in GPU.

PC version, compile as usual:
[url=“http://no.spam.ee/~tonu/kala.c”]http://no.spam.ee/~tonu/kala.c[/url]
works. Output is:


Calculated value: d98c1dd4 04b2008f 980980e9 7e42f8ec
Refercence value: d41d8cd9 8f00b204 e9800998 ecf8427e

Values are same, just endianness is opposite. Now same thing for GPU:

[url=“http://no.spam.ee/~tonu/kernel.cu”]http://no.spam.ee/~tonu/kernel.cu[/url] gives real crap.


result:‘0xffffca4c 0x3c8bdb47 0x0000d6a0 0x00000000’
Refercence value: d41d8cd9 8f00b204 e9800998 ecf8427e

Code is same. All with 32 bit integers.

Two questions:

  1. Why?
  2. How can I debug it? I want to printf() after every calculation. Any good advices? Maybe allocating huge buffer and adding all output there?

I am banging my head two days here with it. I think it is time to ask :)

Don’t be shy. Tell me “/&%¤&% you are so stupid, just reread your code again” or “It should work” or…?

NVIDIA staff, please? Is that your bug or I am stupid or…

Why same C code produces different results on GPU and CPU?

This looks very odd:

[codebox]

global void

kernel (char *charset_ptr, unsigned int charset_len, char *hash_ptr,

int *result_ptr)

{

hash_ptr[4] = charset_ptr[threadIdx.x];

hash_ptr[5] = charset_ptr[blockIdx.x];[/codebox]

The same memory location is assigned over and over from each and every thread, is that really intentional?

Definitely bug. This is my first code on CUDA and I cannot think properly yet about his memory stuff. But this part should not result in different code? Still idea to recheck for possible memory corruptions.