Hi All,
I found there is something strange in __fmul_rz.
I am expecting it to return a product of 2 floating points.
In my device code, I tried
float temp = __fmul_rz(2.0f, 3.0f);
It always return 4 for me…
If I do
float temp = __fmul_rz(3.0f, 2.0f);
It always return 9 for me…
Is it a bug, or do I use it wrong?
Thanks.
This is really interesting one.
see the results
__fmul_rz(1.0f, 0.0f); //1
__fmul_rz(2.0f, 0.0f); //4
__fmul_rz(3.0f, 0.0f); //9
__fmul_rz(4.0f, 0.0f); //16
__fmul_rz(10.0f, 0.0f); //100
__fmul_rz(15.0f, 0.0f); //225
__fmul_rz(4.5f, 0.0f); //20.25
__fmul_rz(33.7f, 0.0f); //1135.689941
It is not even considering the second parameter.
exactly giving the square of the first parameter.
I think someone might give us an explanation on what is the behaviour of __fmul_rz
Well, __fmul_rz rounds toward zero. Check the value of 2.0f*3.0f without any rounding, it is probably something like 5.9999999999999 which will round to 4 when rounding towards zero.
Well, __fmul_rz rounds toward zero. Check the value of 2.0f*3.0f without any rounding, it is probably something like 5.9999999999999 which will round to 4 when rounding towards zero.
[snapback]406656[/snapback]
That would be 5.0 not 4.0
If you look at the results, it returns aa, not a b
The b factor is ignored every time.
I have not used that function, but if these results are right, it looks like its just plain bugged.
OK, I’m obviously very awake this morning…
Post a simple test code that exhibits this behavior (a full application that can be compiled and run), and give an example of the output and expected output. Then others can see if they can reproduce it and someone from NVIDIA can file it as a bug report.
What OS, CUDA version?
Do you have a test case?
I’m using Windows XP Home Edition SP 2, CUDA 1.1.
My card is GeForce 8800 GTX.
static __global__ void test(float *out, int N)
{
unsigned int idx = __umul24(blockIdx.x,blockDim.x)+threadIdx.x;
if( idx<N) out[idx] = __fmul_rz(2.0f, 3.0f);
}
int main( int argc, char** argv)
{
int N = 10;
float *out;
cudaMalloc( (void **) &out, sizeof(float)*N);
dim3 dimBlock(3,1,1);
dim3 dimGrid((N/dimBlock.x) + (!(N%dimBlock.x)?0:1), 1, 1);
test<<<dimGrid, dimBlock>>> (out, N);
float r[10]; // always get 4 as result
cudaMemcpy (r, out, sizeof(float)*N, cudaMemcpyDeviceToHost);
cudaFree(out);
}
Thanks for looking into it.
We were able to reproduce the bug in CUDA 1.1, but it has been fixed in CUDA 2.0.
Could you try to upgrade your system?
Bug gone in CUDA 2.0! External Image
Sibi_A
July 9, 2008, 10:03am
10
Thanks a lot. [External Image ]