An acc kernel compiled with “-std=c++11 -acc verystrict -Minfo=accel -O3 -g -ta=tesla:fastmath -c -fpic” flags took 30 ms to finish. There was a single erfcf(x) call in the source code and after we changed it to “1.0f - erff(x)” and it would only take 22 ms to finish.
Sorry, I don’t know since I’ve never looked at the performance of these intrinsics. We are just calling the CUDA versions of these routines so you may look to see if there’s any materials on the CUDA performance.
My question was on the performance of float type ERFC VS. ERF, where ERFC is the complementary error function and is mathematically equivalent to (1 - ERF). Hope this clarified my original post.