cl-mad-enable Discussing it's effects

I am working on some operations like

a + b * c + d * e

where all the alphabets are doubles. This is performed for very huge sizes.

For optimization, I packaged the above as:


and compiled with cl-mad-enable. I was expecting huge performance improvement but the result was exactly the same. Moreover, the loss of precision was way too big.

Though most of the available literature are full of songs-of-praise, why hasn’t it shown any improvement in my results?

Other combinations I have tried which do not help also are:

var = mad(b,c,a);
var = mad(d,e,var);

cl-mad-enable should enable mad for regular a * b + c notation, AFAIK mad(b, c, a) should do mad in all cases.

From my tests though, NVIDIA enable mad even if you don’t specify cl-mad-enable as the precision is the same, at least for float (for 32 bit multiply add, the intermediate storage is 32 bits). CPU uses a higher intermediate precision and thus would disable mad by default to make sure that you get consistent results.

I thought there was a dependency on the result latency, what is the GPU you use for your tests?
I would have tried to insert at least another computation between the two MAD, or even interleaved 2 series of computation (in the way Intel ispc does it for SSE or AVX using -x2 versions), to avoid register or results dependencies.