What you ask is impossible because floats only have 24 bits of mantissa. Since you are using a Quadro FX5600, you don’t have support for doubles, which would give you 53 bits of mantissa. Integer multiplication in CUDA has 32 bit precision, but you have to cast both operands to ints in order to get the compiler to do that.
It looks like you are trying to map the sine function to the full int range, which adds another problem since the sinf() function is also single precision. Even with 32 bit multiplication precision, the sinf() return value will only have 24 bits of precision.
Thanks! “mantissa” was the word I needed (stupid me).
Meanwhile is seems to calculate mostly OK stuff from input and what I need is 32-bit integer ouput. Hint to CUDA developes to introduct __sini() function :)
BTW, just courious: Does someone can come out idea when solution to this problem is necessary and upgrading video card is not an option?
I do not have problem with this at momemt because I am just learning but theoretically it could be a problem. At moment this computation came from MD5 algorithm initialisation routine. It needs to calculate some constants for later use.
MD5 only needs double precision sines to initialize a constant 32 bit integer table of just 64 values. The computation is not data dependent, so it’s constant for every MD5 compute. It’s not difficult to have the host generate them and send them over in constant or global memory just once.
In general, it certainly would be possible but probably annoying to compute them using older GPUs without double support by using extended precision tricks but it’d be nontrivial!
G200 GPUs of course can do it all natively with doubles.
I know. But if I read GPU optimisation guides, they often say “recalculate, do not cache” and such. And at moment I am just in learning process, so I try everything even when it is not reasonable. MD5 is dead anyway. Just thinking “How would I do it if this really important to solve”. Currently I am bit out of ideas. Usually result is some really simple and genial math. Probably it is possible to split this float into half, compute products separately and join them. Sometime result is easy, so I ask, maybe someone knows this simple trick.
In this case, it would not be so easy with compute capability < 1.3. First you would need to use the “double-single” float representation, which creates a “pseudo-double” out of two single precision floats. The psuedo-double only has 48 bits of mantissa, which is good enough for you here. A standard implementation of double-single arithmetic is provided in the dsfun90 library. You can find the port of many of the dsfun90 functions to CUDA by searching the forum for “dsfun90”.
Once you can do basic arithmetic in this double-single representation, then you need to implement a sin() function using these operations. There are many ways to do this, but the most straightforward way would be argument reduction to reduce x to the interval [0, pi/2], followed by Taylor expansion of the function in this region.
As you might imagine, all of these calculations could take hundreds or thousands of operations per sin() evaluation. This is when “compute, don’t cache” is a bad idea, and you should use constant memory. :)
Definitely bug. This is my first code on CUDA and I cannot think properly yet about his memory stuff. But this part should not result in different code? Still idea to recheck for possible memory corruptions.