I have a program that performs a random walk for a large number of test particles. All particle trajectories, i.e. all threads, are independent. Therefore, I do not have to use shared memory.
If I compile with --use_fast_math, the runtime of a typical simulation is ~160 minutes.
If I do not use the fast math flag, the runtime is much. much higher, about 1625 minutes.
That means, there is a difference of a factor of >10.
I am a little bit irritated by this significant difference. Unfortunately, I was not able to find any benchmarks that compare fast math and no fast math for real applications, but statements that fast math has “some” positive impact on the runtime.
When I compile without fast math, i get the following ptx info:
ptxas info : 77707 bytes gmem, 96 bytes cmem ptxas info : Compiling entry function '_Z11random_walkP17curandStateXORWOWP8particleffii' for 'sm_30' ptxas info : Function properties for _Z11random_walkP17curandStateXORWOWP8particleffii 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads ptxas info : Used 55 registers, 352 bytes cmem, 136 bytes cmem
Using --use_fast_math, the output is
ptxas info : 77707 bytes gmem, 72 bytes cmem ptxas info : Compiling entry function '_Z11random_walkP17curandStateXORWOWP8particleffii' for 'sm_30' ptxas info : Function properties for _Z11random_walkP17curandStateXORWOWP8particleffii 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads ptxas info : Used 47 registers, 352 bytes cmem, 52 bytes cmem
Thus, no fast math uses 8 registers more than fast math. According to the occupancy calculator, the occupancy is 63% for 47 registers and 56% for 55 registers. Can this explain the significant difference of the runtimes (plus some contribution from using fast-math)?
I use a GTX 660 and compile for compute compatibility 3.0 and use launch_bounds(128) for the kernel, since I use a fixed number of threads per block of 128.
Thank you very much,