Why FFMA instructions are still there even if --fmad=false is set?

Hi, all.

I’m writing kernels for pytorch and I want to disable fused multiply-add for some reasons. I added ‘–fmad=false’ according to the official tutorial. However, when I checked SASS assembly of generated pyd by cuobjdump to make sure everything correct, I found only part of FFMA instructions replaced.

That’s wired. Did I do something wrong or is that the expected behavior?
I’m using win10, pytorch1.7 + cuda11.0, gencode=arch=compute_61,code=sm_61

-fmad=false prevents the compiler from contracting an FMUL and a dependent FADD into an FMA. Calls to the standard math functions fma() and fmaf() will result in FMA instructions (FFMA, DFMA) being emitted regardless of the setting of this switch. Such calls can occur inside inlined standard math functions, for example.