is there an __h2min in cuda for half2 data?

wondering if this can be done easily in cuda (without much overhead).

trying to speed up a kernel by converting float to half2 operations. right now, I want to convert the following fp32 line to half2:

dist=fminf(h.x, fminf(h.y, h.z));

where h is a float4. we’ve converted h to two half2 data structures, h1 and h2, but have difficulty to find a command to extract the min of the first 3 (or all 4 if I set h.w to inf) elements.

your suggestion is appreciated.

Curiously enough, I cannot find see min/max intrinsics for half2 operands in the documentation:

http://docs.nvidia.com/cuda/cuda-math-api/group__CUDA__MATH__INTRINSIC__HALF.html#group__CUDA__MATH__INTRINSIC__HALF
http://docs.nvidia.com/cuda/cuda-math-api/group__CUDA__MATH____HALF2__ARITHMETIC.html#group__CUDA__MATH____HALF2__ARITHMETIC

You could build your own using the comparison intrinsics that are supported. You could also double check the header files in case there was an inadvertent omission in the documentation. Lastly, you could file an enhancement request with NVIDIA to add this functionality.

thanks njuffa, for comparing two half2s, I can use the following

half2 h1, h2;

h1=__hfma2(__hlt2(h1,h2), h1, __hmul2(__hle2(h2,h1), h2));

now the lowest two numbers are saved in h1, but the issue is that cuda does not seem to have horizontal min operator (in comparison, horizontal min is possible with SSE in CPU using _mm_min_ps and _mm_shuffle_ps).

in any case, I still don’t see how can I make use of half2 to be faster than the floating point version (i.e. dist=fminf(h.x, fminf(h.y, h.z))).

any other options to do this?

There is no such horizontal operation to my knowledge. You could build your own equivalent of _mm_shuffle_ps() by rearranging the bytes with __byte_perm(), but I am not sure how to best transfer the ‘half2’ to an ‘int’ variable and back. For ‘float’ there is __int_as_float() and float_as_int(), is there something like that for ‘half2’? If not, on could try inline PTX, but I don’t know how to bind a ‘half2’ variable. Maybe use reinterpret_cast ?

Either we are missing something here, or NVIDIA needs to add some more functionality for efficiently dealing with ‘half2’. You could file enhancement requests with NVIDIA to request the primitives you need for your use case.

__byte_perm looks useful. can I do

*((int *)&h1)

to cast it to an integer? for example, I was hoping to use

*((int *)&h1) &= 0x7FFF7FFF;

to take absolute values for an half2, will this work?

Type punning through a pointer cast is not safe (invokes undefined behavior according to C++ standard). That is why CUDA provides bit-wise transfer intrinsics from and to ‘float’ and ‘double’. Check whether CUDA 9 defines any such transfer intrinsics for ‘half2’.

The standard safe idiom for bit-wise transfer in C++ is memcpy() which modern compilers optimize as much as possible (usually resulting in a register->register assignment for cases like this one). Worth a try, but I am not very hopeful the CUDA compiler implements that optimization.

The classical safe C idiom for type punning is the use of a union, you could give that a try. Not sure what the CUDA compiler will make of it. And since it officially invokes undefined behavior in C++ as well, even if it works today it may be broken with future compiler versions.

thanks, I am going to try the union way then. I was afraid of using union because I believe it will eat more of my registers (my kernel is already register hunger), I never verified if this is true.

For example, I have this function in my code

__device__ inline float mcx_nextafterf(float a, int dir){
      union{
          float f;
	  uint  i;
      } num;
      num.f=a+gcfg->maxvoidstep;
      num.i+=dir ^ (num.i & 0x80000000U);
      return num.f-gcfg->maxvoidstep;
}

does that mean I will consume one more register inside this function? how likely will CUDA optimizes this register out?

A union means two or more entities share the same storage, so in this case num.f and num.i refer to the same data object, interpreted in two different ways: as a ‘float’ and as a ‘uint’. In all likelihood ‘num’ will be assigned to a register, and that register will be used as an operand of integer instructions and floating-point instructions.

I managed to code the half precision version of my kernel (just the hotspot portion), the changes can be found here

unfortunately, I am getting a 18% speed drop on P100. I posted this finding in a new thread:

https://devtalk.nvidia.com/default/topic/1026669/cuda-programming-and-performance/converting-fp32-math-to-fp16-fails-to-give-speed-up/