__fmul_rz 2x3 != 6 !!??

Hi All,

I found there is something strange in __fmul_rz.

I am expecting it to return a product of 2 floating points.

In my device code, I tried

float temp = __fmul_rz(2.0f, 3.0f);

It always return 4 for me…

If I do

float temp = __fmul_rz(3.0f, 2.0f);

It always return 9 for me…

Is it a bug, or do I use it wrong?


This is really interesting one.

see the results

__fmul_rz(1.0f, 0.0f);     //1

__fmul_rz(2.0f, 0.0f);     //4

__fmul_rz(3.0f, 0.0f);     //9

__fmul_rz(4.0f, 0.0f);     //16

__fmul_rz(10.0f, 0.0f);    //100

__fmul_rz(15.0f, 0.0f);    //225

__fmul_rz(4.5f, 0.0f);     //20.25

__fmul_rz(33.7f, 0.0f);    //1135.689941

It is not even considering the second parameter.

exactly giving the square of the first parameter.

I think someone might give us an explanation on what is the behaviour of __fmul_rz

Well, __fmul_rz rounds toward zero. Check the value of 2.0f*3.0f without any rounding, it is probably something like 5.9999999999999 which will round to 4 when rounding towards zero.

That would be 5.0 not 4.0

If you look at the results, it returns aa, not ab

The b factor is ignored every time.

I have not used that function, but if these results are right, it looks like its just plain bugged.

OK, I’m obviously very awake this morning…

Post a simple test code that exhibits this behavior (a full application that can be compiled and run), and give an example of the output and expected output. Then others can see if they can reproduce it and someone from NVIDIA can file it as a bug report.

What OS, CUDA version?
Do you have a test case?

I’m using Windows XP Home Edition SP 2, CUDA 1.1.

My card is GeForce 8800 GTX.

static __global__ void test(float *out, int N)


    unsigned int idx = __umul24(blockIdx.x,blockDim.x)+threadIdx.x;

    if( idx<N) out[idx] = __fmul_rz(2.0f, 3.0f);    


int main( int argc, char** argv)


	int N = 10;

	float *out;

	cudaMalloc( (void **) &out, sizeof(float)*N);

	dim3 dimBlock(3,1,1);

	dim3 dimGrid((N/dimBlock.x) + (!(N%dimBlock.x)?0:1), 1, 1);

	test<<<dimGrid, dimBlock>>> (out, N);

	float r[10]; // always get 4 as result

	cudaMemcpy (r, out, sizeof(float)*N, cudaMemcpyDeviceToHost);



Thanks for looking into it.

We were able to reproduce the bug in CUDA 1.1, but it has been fixed in CUDA 2.0.
Could you try to upgrade your system?

Bug gone in CUDA 2.0! :geek:

Thanks a lot. [:clap: ]