Page 51 of the 0.81 guide is said that floating-point multiply-add instructions takes 2 clock cycles.
But I didn’t found any multiply-add operator or function in the documentation.
Such operations are possible with SSE intrinsics by examples, but specific functions are provided.
For CUDA, I didn’t found anything. In the matrix multiplication samples, which uses such multiply followed by add, add operator and multiply operator are used. Does nvcc automatically detects succession of add and multiply operations ?
How is it performed ?
How to do a multiply-add instruction ?
Thanks a lot