Generated PTX file twice bigger than Optix7 SDK sample

Building OptixPathTracer from the OptixSDK Samples generates 18Kb PTX file, the exe runs on GTX1060 with ~5fps
while building from scratch the same code generates a 39Kb PTX file and runs slower ~4fps.

Win10.0.17763.0, Cuda10.1, Optix7.0.0, VS Community 2019

Probably I’m missing something but can’t figure it out, here is the procedure taken:

  1. Created a new CUDA Project in VS2019
  2. Hand copied the project configuration from the sample on every category, only the include and lib paths differ but point to the same SDK
  3. Created a new CUDA Source file, cloned the content from the OptixPathTracer.cu
  4. Copy/Pasted the sample header and cpp files inside the new project
  5. Changed the Project Configuration for CUDA C/C++ (-maxrregcount=0 --machine 64 -ptx -cudart static)
  6. Superstition number, decided to have a coffee.
  7. Project build successfully in Release mode, but the generated .ptx file has a lot more instructions compared with freshly build from the OptixSDK Samples.

After revisiting the project from the SDK I see that it doesn’t have CUDA as build dependency but instead it uses CMakeLists with what I’m not very familiar but obviosly is the only difference between both projects, also it’s my first experience with both CUDA and Optix and probably there is a better way to create a new project using those two that I’m not aware of but for now I’ll be happy if I could at least build the optimized .ptx file.

The Q is what nvcc flags I’m missing?

If you diff-ed the *.ptx sources and found that the small code contained “approx” instructions for the trigonometric functions and square roots, and your code doesn’t, then you’re missing the --use_fast_math option.

Then you might not have used the same streaming multiprocessor target and some generate additional spurious runtime functions (which aren’t used).

Check your NVCC options for these things as well:
https://devtalk.nvidia.com/default/topic/1052566/optix/assertion-failed-quot-acp-gt-isusedassinglesemantictype-quot-/post/5343482/#5343482

Indeed this was the cause, in the property pages under CUDA C/C++ fast math option is available only for the host and was turned on, on device tab there is no such option, just added --use_fast_math in the Command Line additional options and now the .ptx is optimized, thanks!