I have some Cuda fortran code that makes heavy use of trigonometric functions for complex numbers.
I have noticed that the code produces correct output when I compile without optimization flags, i.e., -Mcuda=maxregcount:32,cc30,cuda5.5
However, as soon as I turn on optimization (e.g., -O2), results are corrupted. It’s difficult for me to find the problem but I am suspicious that optimization breaks some of the device intrinsics that I am using.
Any suggestions why this could happen? I thought that optimization only affect the CPU part of the code, not the device intrinsics.