Hello, I was looking for min, max and sign functions in CUDA, but I found nothing. I am using the very basic macro definitions for them and I don’t like it much, because they could lead to warp divergence

Now with Fermi there is something called predication, which I don’t understand well, that supposedly takes care of such small divergence, what I understood is that it only takes care of branching overhead, meaning threads in the warp do both branches without branching overhead, however this contrasts with just a single function call that involves no branching (or is done in hardware like addition).

So what I am asking, does there exist functions for min, max and sign in CUDA that do not involve branching or are implemented in the most efficient way possible? If so how can I use them, which header file are they in?

I can imagine a function for sign that can be implemented on hardware, it could just return the sign bit or something. I am not so sure about min and max, but regardless if CUDA implements some version of min, max or sign, I would like to know.

In general, CUDA supports the full set C99 standard math functions, plus various common extras (e.g. sincos, exp10, rsqrt, j0, j1, jn, y0, y1, yn). The online help does not seem to mention overloaded functions like min() and max() which CUDA supports for just about any scalar type.

The GPU hardware supports integer and floating-point min / max operations directly via dedicated instructions. The handling of special operands in the floating-point variants follows the IEEE-754(2008) standard, in particular a min / max operation in which exatcly one operand is a NaN returns the non-NaN operand; this sometimes comes as a surprise to programmers.

Now I see that there is a different function for single precision min/max and double precision min/max, fminf fmaxf fmin and fmax respectively. I am using a typedef real which can be either a float or a double, and I want to apply max and min on variables of type real, so which function should I use? If I use the double precision, and my real is a float, then perhaps the compiler can cast the float to a double, but then the double precision is sure to take more cycles, and if I use the single precision function then if my real is a double, I would lose precision.

I could be wrong with this analysis and using the double precision function would cost just as much as the single, but if I’m not how can I pick the right function to use? I could use macros, but I would like to avoid that if possible, or use if statements, which kind of defeats the purpose. Any suggestions?

Thanks to overloading, you can use the generic function name with both float and double arguments. I do not offhand know where the following is documented these days, but when double-precision support was first added to CUDA it was mentioned in the release notes: