FMADD non-deterministic?

Hey, just a quick question for one of the nvidia guys…

Am I right to assume your optimized fmadd instruction (which isn’t full precision), is also non-deterministic on the same device?
(eg: an fmadd with the same inputs, could generate different outputs - on the same device?)

The reason I ask is we have an internal test case which validates the determinicity of our algorithms, and our CUDA accelerated ones - algorithmically ‘should’ be deterministic, but aren’t - and the two prime candidates are either CUDA or OpenGL - and we’ve all but ruled out OpenGL (in fact the CPU based implementations also use OpenGL, and are still deterministic), which leaves CUDA (and I’ve seen reports on the forums relating to fmadd and deterministic behaviour before).

Cheers,

While I haven’t attempted to determine whether FMADD, specifically, is deterministic, I have done tests using HOOMD. Given the same initial conditions run several times (and on several different GPUs, even), I get exactly identical binary output for the position of every particle through time. And there certainly must be a few FMADDs being generated in the code. Also note that molecular dynamics is chaotic, so even a 1-bit difference in the least significant bit at one point in time would magnify into a very large difference over a long simulation.

Is the device returning different results on different runs of the same executable, or is it simply different from the CPU version?

CPUs often carry extra bits in the floating-point registers, making intermediate values more exact than ordinary floating point would be. These additional bits are not stored in memory, which means slightly different outputs can result from the compilers whims as to which registers spill to memory and when. For this reason, Debug and Release versions very often differ on the CPU, but the effect can cause differences even with the same compiler settings, with a very minor code change. Bottom line is CPU floating point is “nondeterministic” unless extra measures are taken to force determinism (at the expense of speed, of couse).

I do not know whether the GPUs carry extra bits which would be lost when stored and re-loaded from memory, but my guess is that they don’t.

Are you doing a reduction in your algorithm?

How are you comparing “determinicity” of the algorithms?

Also, if you’re not getting the EXACT same results (but close) between the CPU and GPU code…that is to be expected due to the differences in the algorithms (and the floating point round-off errors inherent in the sequential/parallel algorithms).

We’ve never seen any evidence of any math operation being nondeterministic, with hundreds of machines running regression suites every night (for a few years now). There are a host of reasons why it wouldn’t equal a CPU implementation exactly, but there’s no reason why values would differ between runs if your kernel is implemented correctly (no race conditions or use of uninitialized memory, for example).

If you’re using a consumer board, though, you might want to check that it’s not overclocked. Our QA labs use cards with the standard clocks, and some people on these boards have reported CUDA problems on overclocked cards.

This isn’t an issue of it being different from the CPU implementation (that much is all but guaranteed), the problem is we’re getting slightly different results for the same inputs somewhere in our CUDA version of that algorithm (I’m yet to have the time to whittle my way down to figure out ‘exactly’ what’s coming out wrong) - I just wanted to check if it was potentially an issue or not.

I’m guessing it’s something to do with OpenGL in that case (our CUDA accelerated implementation also uses hardware accelerated OpenGL, as opposed to Mesa w/ our CPU implementation) - which is the only other change made to this implementation.

Nothing is overclocked manually, nor came overclocked (like some GPUs).

Thanks for the boost of confidence though, 2+ years of regression tests on hundreds of machines is encouraging.

We run the algorithm over a video stream multiple times, taking checksums of all core variables for each frame - and compare the checksums between each run of the video stream. Thus if any variable is off by even a single bit, the checksum will be radically different.

For our CUDA accelerated algorithm implementation, it differs from itself specifically around the CUDA/OpenGL related code segments (CUDA calculates various variables which are used in OpenGL rendering later on (texture coordinates / transformation matrices, etc).

Edit: Fixed quote.

Well then you’re definitely going to see different results every time. Using DirectX and OpenGL with the exact same resolution, shader, mesh, and transformation matrices won’t even give you the exact same image due to very fine subtleties in the graphics driver. There are small differences in things like fragment center coordinates and things of that nature that can affect the final output of the rasterization. If you’re trying to get a pixel perfect reproduction using h/w accelerated opengl vs. software rendered Mesa you’re basically on a fool’s errand.

Just to give you one quick example that I know with absolutely certainty Mesa is doing perspective correction every fourth pixel when rasterizing (without hardware interpolation units this saves costly reciprocal calculations at each pixel and allows linear interpolation of vertex attributes at most pixels) whereas the GPU will do this every pixel. You MIGHT be able to try to reduce the differences by fine tuning rendering settings (for instance, if Mesa has an equivilent of setting the perspective correction high quality hint) but I very seriously doubt you’ll eliminate all the differences.

If you want to ensure the results are the same I suggest comparing the absolute differences in your floating point input data (your transformation matrices and the like) rather than checksumming or comparing the output video.

Our unit tests say otherwise, as would my experience with OpenGL - it’s naturally expected that we get different results with different cards/drivers, but the resulting rastered image should be (and out unit tests prove this to be the case for a wide variety of renders (using a wide variety of OpenGL features & extensions), passing on two build servers with different cards (one nVidia the other AMD)) identical given the exact same inputs, on the exact same card, with the exact same state setup - or atleast that’s proven to be the case for the past few years.

Edit: I should note we don’t use shaders (at all) in our codebase, so I can’t comment on that - but again, with the same card/driver/state settings, I would expect the output to be deterministic unless the shader was using the random functions, or time-based variables, etc…

Well yeah, for one API. OpenGL like DirectX provides OEMs with a reference specification and they must follow it precisely so that the API will behave consistantly on all hardware. What I mean is good luck getting the exact same pixel for pixel image using 2 different APIs like DirectX vs OpenGL, or Mesa vs. OpenGL, however similar they may be in Mesa’s case.