Question on --use_fast_math speedup and register usage

Hello all,

I have a program that performs a random walk for a large number of test particles. All particle trajectories, i.e. all threads, are independent. Therefore, I do not have to use shared memory.

If I compile with --use_fast_math, the runtime of a typical simulation is ~160 minutes.
If I do not use the fast math flag, the runtime is much. much higher, about 1625 minutes.
That means, there is a difference of a factor of >10.

I am a little bit irritated by this significant difference. Unfortunately, I was not able to find any benchmarks that compare fast math and no fast math for real applications, but statements that fast math has “some” positive impact on the runtime.

When I compile without fast math, i get the following ptx info:

ptxas info    : 77707 bytes gmem, 96 bytes cmem[3]
ptxas info    : Compiling entry function '_Z11random_walkP17curandStateXORWOWP8particleffii' for 'sm_30'
ptxas info    : Function properties for _Z11random_walkP17curandStateXORWOWP8particleffii
    32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 55 registers, 352 bytes cmem[0], 136 bytes cmem[2]

Using --use_fast_math, the output is

ptxas info    : 77707 bytes gmem, 72 bytes cmem[3]
ptxas info    : Compiling entry function '_Z11random_walkP17curandStateXORWOWP8particleffii' for 'sm_30'
ptxas info    : Function properties for _Z11random_walkP17curandStateXORWOWP8particleffii
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 47 registers, 352 bytes cmem[0], 52 bytes cmem[2]

Thus, no fast math uses 8 registers more than fast math. According to the occupancy calculator, the occupancy is 63% for 47 registers and 56% for 55 registers. Can this explain the significant difference of the runtimes (plus some contribution from using fast-math)?

I use a GTX 660 and compile for compute compatibility 3.0 and use launch_bounds(128) for the kernel, since I use a fixed number of threads per block of 128.

Thank you very much,

Phillip

The change in occupancy you describe is unlikely to cause such a huge difference in runtime.

Can you say more about what math operations your kernel performs? Among other things, the fast math option switches from accurate implementations of transcendental functions, like sin() and cos(), to less accurate implementations that make use of the special functions hardware unit on the multiprocessor. (The CUDA C programming guide has an appendix that gives more detail on the accuracy loss. Generally speaking, the hardware implementation is most accurate for small values of the argument.) I am still a little surprised that would be a factor of 10 difference, but I haven’t written the microbenchmark to check.

If the code is mainly composed of operations that are affected by -use_fast_math, the difference in performance could be very significant, but I am surprised to see a 10x difference. Without seeing the source code and comparing the machine code produced with and without -use_fast_math it is not possible to give a diagnosis as to how that massive speedup comes about. I suspect a combination of direct speed-up from faster approximate operations, additional compiler optimizations being enabled by the simplified code, and improved occupancy.

I assume this is on a GPU with compute capability >= 2.0, and that the computation is all single-precision? For architectures >= sm_20 -use_fast_math implies -ftz=true -prec-div=false -prec_sqrt=false. That is, denormal support is disabled for single-precision computation, and single-precision square root, division, and reciprocal are approximate rather than rounded according to IEEE-754. In addition, various single-precision transcendental functions are mapped to less accurate, HW-accelerated versions (see Programming Guide for details).

The error of the hardware-accelerated transcendental functions is difficult to characterize in the normal metrics used for floating-point arithmetic, such as ulp error or relative error. See the table in the back of the Programming Guide for information on the error. They hardware acceleration is based on quadratic interpolation in tables using fixed-point arithmetic, so the relative error can be very large for some inputs, for example for __sinf() near 0 and __logf() near 1.

Does the kernel implement an iterative algorithm where the number of iterations could be strongly influenced by the accuracy of the computation? If so, you might want to compare how many iterations are performed with and without -use_fast_math, and whether the quality of the results of both versions is sufficiently close. If there is a significant difference in the number of iterations, that could explain the drastic difference in run time.

The code I used for the problem stated above can be seen at stackoverflow; njuffa already helped me to understand another issue there. [url]http://stackoverflow.com/questions/19034321/cuda-double-demoted-to-float-and-understanding-ptx-output[/url]
As you can see, this (not optimized code) performs a lot of trigonometric operations.

Let me briefly explain what the kernel function is doing: I inject a number of particles in a sphere and let them perform a random walk in three dimensions (spherical) according to diffusion coefficients that depend on position plus a non-stochastic change in the particles energy as a function of time. There are three absorbing boundaries, namely an inner one, an outer one and an absorbing boundary inside the sphere. If a particle hits one of these boundaries, it does not propagate any longer. Consequently, the number of “free” particles is not constant. For the integration of the trajectories I define a time step (dt, like 1e-3 or 1e-4) and a maximal time. If the program reached this time, the kernel returns and the last position of all particles or the position when they hit a boundary and the corresponding time step is written to a file.

Meanwhile I did some further experiments and also wrote an optimized version of my kernel function (where I make use of sincos() and maximize the reuse of already compute data).
The results concerning the runtime for fast math and without fast math for both versions of the program can be summarized as follows:

For the old code:

Number of particles: 1e6, dt=1e-4, t_end=2.0 program unit times

Runtime with fast math: 17.56s
without: 98.7s => x 5.61

Number of particles: 1e6, dt=1e-3, t_end=2.0 program unit times

Runtime with fast math: 9.7s
without: 17.6s => x 1.8

Number of particles: 1e6, dt=1e-4, t_end=5.0 program unit times

Runtime with fast math: 30.8s
without: 229.5s => x 7.4

Number of particles: 2e6, dt=1e-4, t_end=5.0 program unit times

Runtime with fast math: 65.2s
without: 464.3s => x 7.1

For the new code:

Number of particles: 1e6, dt=1e-4, t_end=2.0 program unit times

Runtime with fast math: 16.2s
without: 63.4s => x 3.9

Number of particles: 1e6, dt=1e-3, t_end=2.0 program unit times

Runtime with fast math: 9.5s
without: 14.3s => x 1.5

Number of particles: 1e6, dt=1e-4, t_end=5.0 program unit times

Runtime with fast math: 27.6s
without: 144.2s => x 5.2

Number of particles: 2e6, dt=1e-4, t_end=5.0 program unit times

Runtime with fast math: 57.71s
without: 288.3s => x 5.2

From this I conclude that the runtime goes linearly with the number of particles as one would expect.
I can be seen, however, that there is quite a large difference for the runtime with and without using fast math depending on the time step used for the integration of the trajectories as well as the end time: For dt=1e-4 the fast math/no fast math ratio is much larger than for dt=1e-3. For a t_end of 5 the ratio is larger than for t_end=2.

I would expect some non-linearities in the runtimes because of the nature of the algorithm, but I do not understand the massive difference for fast math.

The final results (statistical properties) of the particles are in good agreement with/without fast math, normally not larger than 1%, while the individual trajectories may be quite different. For the example of 2e6 particles, 1208 particles show different arrival times at the boundary. I can not image that these differences in the trajectories can have such a significant impact on performance.

These speedups look more plausible than the originally stated “> 10x”, although they are still on the high side of what I would expect. Checking the throughput of some of the operations present in your code on Kepler, I see a throughput ratio of about 3x for __sincosf() vs sincosf(), 3.5x between div.approx.ftz.f32 and div.rn.f32, and 2.0x between sqrt.approx.ftz.f32 and sqrt.rn.f32. So for code dominated by these operations a performance difference of 2x-3x seems feasible based on that data.

The IEEE-rounded division and square root are called subroutines while the approximate versions are inlined. This then allows the compiler much larger freedom during instruction scheduling. There may be other optimizations, such as common subexpression elimination, that benefit from the inlining. The reduction in register pressure which results in increased occupancy could also improve performance.

Hello,

note that for the runtime comparison in my last post I used a max. time of 5.0. In the original post I used 10.0. It seems that the speedup increases with increasing max. time.

Indeed, a factor of <3 is what I would have expected for a code without shared memory and independent threads but many calls to math functions.

A mentioned above it can happen that some at some point a number of threads just do nothing because they already hit a boundary. May this have impact on the runtime behaviour, e.g. the max. number of blocks that operate on a SM and change some other parameter to the benefit of fast math?

Btw: the CPU (i7, single thread) implementation of the same algorithm using the GSL random number generator is about 210x slower compared to the CUDA implementation (no fast math) for a large number of particles (a million) what sound reasonable considering the high degree of parallelism. However, a fast math speedup of lower-equal 10 seems still very strange to me (although the CUDA showcase gives some examples of code with a performance increase of three orders of magnitude).

I would think the most important question is whether the code version compiled with -use_fast_math delivers correct results. Whenever there is an unexpected large speedup, it is worthwhile to check that all kernels actually ran, algorithms executed the expected number of iterations, the relative errors (or residuals) for the results are in the expected range, and so on. I do not readily see why the speedup to be had from -use_fast_math would be increasing with the runtime, unless a longer runtime also entails additional configuration changes.

CPU / GPU performance comparisons claiming performance differences of three orders of magnitude warrant heightened scrutiny.

Yes, this is one of the most important questions. I started to compare the results of the CPU and CUDA version (with and without fast math) for several random number seeds with respect to the statistical properties to get an idea what deviations are typical and thus acceptable for the problem.

However, I finally think that I may use fast math only for quick-look results. The distribution of the residuals is not a Gaussian, but somewhat biased.

Thank you very much for your suggestions.

I will caution against using CPU / GPU comparisons as the basis for establishing correctness or accuracy. Both computations will have numerical error. As the saying goes: “A man with one watch knows what time it is. A man with two watches can never be sure”. Which results are closer to the correct result can be established in most cases by using a high-precision reference (I use double-double arithmetic for much of my own work).

If the distributions are off, you may also want to look carefully at the random number generation. Is it initialized properly so all threads use an independent streams of random numbers? For long-running simulations that consume massive amounts of random numbers, you may want to try different PRNGs.

Because I use the XORWOW generator on the GPU and the GSL Mersenne Twister it can not reproduce the same trajectories. Therefore I compare the statistical properties of the results like the mean arrival time of the particles at a boundary. I did this for a number of different seeds and the average arrival time and the standard deviation are in good agreement. However, I haven’t compared the fast math results yet.

To initialize the PRNG status for my particles (<2000000), each of the corresponding thread’s status is initialized using

int id = threadIdx.x+blockIdx.x*blockDim.x;
curand_init(seed, id, 0, &state[id]);

Each threads draws not more than ~500000 random numbers for long run (in most cases much less). I think this would not results in any conflict because 500000 is much less than 2^67, i.e. the sequence length before threads may overlap.

Are the results any different when you use CURAND’s Mersenne Twister ?

Good remark. I tried to use the MT some weeks ago but had some troubles with the implementation. Because the maximal number of blocks for the MT is 200, I tried to follow the approach by using the

int tid = threadIdx.x + blockIdx.x * blockDim.x;

while (tid < n) {

/* code */

tid += blockDim.x * gridDim.x;

}

approach shown in the “CUDA by Example” book to handle arbitrarily long vectors with a fixed number blocks/grid. However, the result was that the length of the random sequence for all threads was four before the sequence started again. If I remember correctly, a simple test program gave correct results.
I will reinvestigate my MT code in the nearer future to find the error.

As mentioned above, CUDA code with XORWOW seems to be in good agreement with the CPU code.

The use of XORWOW (which is much faster than the Mersenne Twister) algorithm also probably contributes to the large speed difference between the GPU and CPU implementations.