__float_as_int compiler bug

Hellow. i have to port CPU code that use SSE to CUDA.

So i desided to write some _mm_dosomething functions

but i had stuck with that.

this function works fine under device emulation, but it always gives all result equal to zero on GPU.

__device__ vec4f _mm_and_ps(const vec4f& a, const vec4f& b)


 vec4f res;

	res[0] = __int_as_float(__float_as_int(a[0]) & __float_as_int(b[0]));

	res[1] = __int_as_float(__float_as_int(a[1]) & __float_as_int(b[1]));

	res[2] = __int_as_float(__float_as_int(a[2]) & __float_as_int(b[2]));

	res[3] = __int_as_float(__float_as_int(a[3]) & __float_as_int(b[3]));

	return res;


(vec4f is some-thing like float4)

i use CUDA 1.1, may be this error occures on CUDA 2.0.

my questions:

  1. Is that a compiler bug?

  2. Is there any way to make correct _mm_and_ps function to use it on GPU ?

may be i can use ptx assembler?

forward thanks

Try using decuda. You can see the exact hardware instructions emitted and whether the compiler is doing something wrong.

Btw, what’s vec4f? A class with an overloaded operator ? Can you post its source code?

Thanks for the answer. I think - decuda is good idea. I have a problem with decuda, but i will try it again.

you right about vec4f, it’s a MGML_MATH::VECTOR<4,float> from the MGML_MATH library.

template<int n,class T>

class VECTOR : ...


	inline universal_call T operator[](int i) const

  { ASSERT(i<n); return this->M[i];}

	inline universal_call T& operator[](int i)

  { ASSERT(i<n); return this->M[i];}



#define universal_call __device__ __host__


#ifdef __CUDACC__ 


	#undef ASSERT 

	#define ASSERT(_expression) \

	if(!(_expression)) { fprintf(stderr,"Assertion failed. File: %s, Line %d\n",__FILE__,__LINE__); \

	abort(); }



btw - my previous code must give compiler error because it calls device host from device. Am i right?

you can see full code of vec4f here, it’s large and complex due to templates metaprogramming http://ray-tracing.ru/upload/free/CERF/CER…MPLE_alpha2.rar (in MGML_VECTOR.h)

but this function

inline __device__ float4 _mm_and_ps2(const float4& a, const float4& b)


	float4 r;

	r.x = __int_as_float(__float_as_int(a.x) & __float_as_int(b.x));

	r.y = __int_as_float(__float_as_int(a.y) & __float_as_int(b.y));

	r.z = __int_as_float(__float_as_int(a.z) & __float_as_int(b.z));

	r.w = __int_as_float(__float_as_int(a.w) & __float_as_int(b.w));

	return r;


doesn’t work too(

Yeah… dunno. Clearly a bug. Try it with cuda 2.0. Try it with call-by-value (ie, don’t use const&). Try it with ((int)&a.x) instead of __float_as_int(). Try it with plain floats, not float4s.

It sounds as if you generate denormals that get flushed to zero, or NaNs that get normalized.

Are the inputs really valid floats or rather integers pretending to be floats?

there are floats of both types - real floats and (0xFFFFFFFF) mask values.

i made some simple tests with just float and float4. And they have passed. I have found that this problem occured only when i use my vec4f. Even if i convert vec4f to float4 and vice versa (thats really strange). I dont know why, because my vec4f never gives any problems with operator or anything else.

I have found a stable function

inline __device__ vec4f _mm_and_ps(const vec4f& a, const uint4& b)


	vec4f r;

	r.M[0] = __int_as_float(__float_as_int(a.M[0]) & b.x);

	r.M[1] = __int_as_float(__float_as_int(a.M[1]) & b.y);

	r.M[2] = __int_as_float(__float_as_int(a.M[2]) & b.z);

	r.M[3] = __int_as_float(__float_as_int(a.M[3]) & b.w);

	return r;


that’s not the same, but it’s fine for me

Anyway, thanks for help. I think, i will put ptx on this topic later, because i want to know why this error occured.

Yes, do use the decuda

Did you end up trying CUDA 2.0? You should really be using that now that it’s final. It’s still got bugs, but a good deal fewer than 1.1.

finally, i have found the cause of this error. It was operator. But it is very strange, because it’s a first time i have problems with operator. And it is interesting that i can not reproduce this error with simple structure. It appears just with my MGML_MATH::VECTOR<4,float> :blink:

so this tricky function give bugs

inline __device__ vec4f make_vec4f(float x, float y, float z, float w, vec4f& v)


  v[0] = x;

  v[1] = y;

  v[2] = z;

  v[3] = w;

  return v;


but this function works

inline __device__ vec4f make_vec4f(float x, float y, float z, float w, vec4f& v)


  v.M[0] = x;

  v.M[1] = y;

  v.M[2] = z;

  v.M[3] = w;

  return v;


in the following code i have use them both ang get different ptx assembler

__global__ void test_kernel2(vec4f* in_a, vec4f* in_b, vec4f* out_result)


	uint x = blockDim.x * blockIdx.x + threadIdx.x;

	uint y = blockDim.y * blockIdx.y + threadIdx.y;

	uint w = 8*4;

	uint tid  = x + w*y;

	vec4f a = in_a[tid];

	vec4f b = in_b[tid];

	vec4f mask = make_vec4f(__int_as_float(0xffffffff),




	vec4f result = _mm_and_ps(a,mask);

	out_result[tid] = result;


correct ptx (seems to be optimezed, but it works)

d.param.u32  $r10, [__cudaparm__Z12test_kernel2PN9MGML_MATH6VECTORILi4EfEES2_S2__in_a];	


	add.u32  $r11, $r10, $r9;      //  

	ld.global.f32  $f1, [$r11+8];  //  id:93

	.loc	24	435	0

	ld.global.s32  $r12, [$r11+0];	//  id:94

	mov.b32  $f2, $r12;            //  

	mov.f32  $f3, $f2;            	//  

	mov.f32  $f4, 0f00000000;      //  0

	mov.f32  $f5, $f4;            	//  

	mov.b32  $r13, $f1;            //  

	mov.b32  $f6, $r13;            //  

	mov.f32  $f7, $f6;            	//  

	.loc	2	234	0


ld.param.u32  $r14, [__cudaparm__Z12test_kernel2PN9MGML_MATH6VECTORILi4EfEES2_S2__out_result];

incorrect ptx

ld.param.u32  $r10, [__cudaparm__Z12test_kernel2PN9MGML_MATH6VECTORILi4EfEES2_S2__in_a];	

	add.u32  $r11, $r10, $r9;      //  

	ld.global.f32  $f1, [$r11+0];  //  id:109

	ld.global.f32  $f2, [$r11+4];  //  id:110

	ld.global.f32  $f3, [$r11+8];  //  id:111

	.loc	24	423	0

	mov.f32  $f4, 0fffffffff;      //  -1.#QNAN

	mov.f32  $f5, $f4;            	//  

	mov.f32  $f6, 0f00000000;      //  0

	mov.f32  $f7, $f6;            	//  

	mov.f32  $f8, 0fffffffff;      //  -1.#QNAN

	mov.f32  $f9, $f8;            	//  

	mov.f32  $f10, 0f00000000;    	//  0

	mov.f32  $f11, $f10;          	//  

	.loc	24	435	0

	mov.b32  $r12, $f1;            //  

	cvt.rzi.s32.f32  $r13, $f5;    //  

	and.b32  $r14, $r12, $r13;    	//  

	mov.b32  $f12, $r14;          	//  

	mov.f32  $f13, $f12;          	//  

	mov.b32  $r15, $f2;            //  

	cvt.rzi.s32.f32  $r16, $f7;    //  

	and.b32  $r17, $r15, $r16;    	//  

	mov.b32  $f14, $r17;          	//  

	mov.f32  $f15, $f14;          	//  

	mov.b32  $r18, $f3;            //  

	cvt.rzi.s32.f32  $r19, $f9;    //  

	and.b32  $r20, $r18, $r19;    	//  

	mov.b32  $f16, $r20;          	//  

	mov.f32  $f17, $f16;          	//  

	.loc	2	234	0

ld.param.u32  $r21, [__cudaparm__Z12test_kernel2PN9MGML_MATH6VECTORILi4EfEES2_S2__out_result];

but i can not reproduse that error with the simple structure with operator, so one can get it only with MGML_MATH.

anyway thanks for help. I think it is better not to use C++ features as frequently as i do that.

What about in post #3 when you said a simple float4 didn’t work either?