I dont see it mentioned in the CUDA Toolkit 12.6 release logs. Is there anywhere I can find commentary on the design intent and any known effects / usecases?
How many settings (e.g. 0 1 2 3) does it offer, and what features are turned off / on for each setting?
Could you provide any information for what percent speedup one might expect?
When compiling my ptx however, it seems to make compilation run slower:
# without
time for i in {1..300}; do ptxas -arch=sm_60 my_code.ptx; done
real 0m6.026s
user 0m3.725s
sys 0m2.343s
# with -Ofast-compile
time for i in {1..300}; do ptxas -arch=sm_60 my_ptx -Ofc=max; done
real 0m6.450s
user 0m3.796s
sys 0m2.694s
That said, I’ve also found that -O0 compiles this same ptx slower than -O1 which makes little sense to me.
Any tips on making compilation run faster would be appreciated.