Are max(a, b) and min(a, b) divergent?

FelixTech · June 16, 2011, 5:49pm

It a simple question, but one which I don’t see answered anywhere in the programming guide.

The standard CPU implementation seems to be:

(b<a) ? a : b;

which is clearly divergent, but I’d like to know if CUDA does anything clever to get around it.

Also when doing something like

a = max(a,0);

will the compiler reduce that to

a *= (a>0)

to prevent divergence (assuming that max is divergent in the first place)?

EDIT: I particularly care about when a and b are floats, but a more general answer may be helpful for others. I hope there is someone who knows!

FelixTech · June 16, 2011, 5:49pm

It a simple question, but one which I don’t see answered anywhere in the programming guide.

The standard CPU implementation seems to be:

(b<a) ? a : b;

which is clearly divergent, but I’d like to know if CUDA does anything clever to get around it.

Also when doing something like

a = max(a,0);

will the compiler reduce that to

a *= (a>0)

to prevent divergence (assuming that max is divergent in the first place)?

EDIT: I particularly care about when a and b are floats, but a more general answer may be helpful for others. I hope there is someone who knows!

njuffa · June 16, 2011, 5:57pm

The “standard” C source-level idioms to compute min() and max() do not usually result in divergent code for most scalar types, since the compiler translates them into predicated execution or select instructions. You can use cuobdump to disassemble the machine code if you need to know for sure.

The min() / max() functionality is directly supported by hardware instructions for many scalar types (e.g. int, float, double) on both sm_1x and sm_2x devices. The various min/max instructions will be readily noticeable in disassembled machine code by their names.

njuffa · June 16, 2011, 5:57pm

The “standard” C source-level idioms to compute min() and max() do not usually result in divergent code for most scalar types, since the compiler translates them into predicated execution or select instructions. You can use cuobdump to disassemble the machine code if you need to know for sure.

The min() / max() functionality is directly supported by hardware instructions for many scalar types (e.g. int, float, double) on both sm_1x and sm_2x devices. The various min/max instructions will be readily noticeable in disassembled machine code by their names.

FelixTech · June 16, 2011, 8:21pm

Thank you for the quick response!

FelixTech · June 16, 2011, 8:21pm

Thank you for the quick response!

njuffa · June 16, 2011, 10:01pm

You are welcome. I notice belatedly that I should have been clearer in the second paragraph. By “min() / max() functionality” I was refering to the overloaded min() and max() functions that CUDA makes available in device code, as opposed to discrete source level constructs such as macros that accomplish the same thing. The functions are what gets translated into min/max instructions for common scalar types; the discrete constructs will typically map to predicated execution or select instructions for common scalar types.

The overall message is that programmers should not worry about divergence due to min/max computations. In general the direct use of built-in min() and max() functions will result in somewhat higher performance than the use of discrete equivalents, as the dynamic instruction count will be minimized.

njuffa · June 16, 2011, 10:01pm

You are welcome. I notice belatedly that I should have been clearer in the second paragraph. By “min() / max() functionality” I was refering to the overloaded min() and max() functions that CUDA makes available in device code, as opposed to discrete source level constructs such as macros that accomplish the same thing. The functions are what gets translated into min/max instructions for common scalar types; the discrete constructs will typically map to predicated execution or select instructions for common scalar types.

The overall message is that programmers should not worry about divergence due to min/max computations. In general the direct use of built-in min() and max() functions will result in somewhat higher performance than the use of discrete equivalents, as the dynamic instruction count will be minimized.