CUDA different results when running on different driver?

I recently installed the new 196.75 drivers from nvidia and tried rerunning some neural network code (written by someone else) in my CUDA 2.3 SDK.

Strangely enough after upgrading the drivers the code runs substantially faster, but much to my surprise the results are substantially different. When running training the network I noticed the root mean squared error no longer settles as it did before with the old drivers. And the classification rates when testing the network are much worse (although they weren’t very good to begin with).

Is it possible that the drivers can affect the actual values of the calculations the GPU performs? Or can I just not use the CUDA 2.3 SDK with these newer drivers?

the root mean square is not supposed to settle… it’s supposed to keep getting smaller. But if it keeps changing randomly, it’s either the code is wrong or the learning rate is too high(so it would keep jumping out of the local minima).
Maybe you should check how the learning rate and the steepness of the activation function are determined. Another thing could be the random number generator.
i’m noob to cuda… i guess if anything device-dependent or driver-dependent have been used to determine some parameters, you would get different behaviour on different devices/drivers.

Well I’m comparing these results to backpropagation that was implemented on the CPU. In that version the RMSE reaches a lower value and stays relatively steady (with some small oscillations). Both the CUDA and CPU implementation have the same learning rate, iterations, epochs, etc.

So I was wondering if it’s plausible that a driver update could actually change the results of calculations done on the GPU?

Do you have an option to check for CUDA errors after each kernel launch? A new driver brings a new JIT compiler which may produce a different number of registers per thread than the old one. That could lead to some of your kernels not launching.

Well this brings me to another question, many of the sample programs in the CUDA SDK seem to calculate the L2 error when comparing results to the CPU. If both systems use single precision floating point, wouldn’t the error be zero when performing the same calculations?

This never works out for a number of reasons:

  • Floating point is not associative. The exact result depends on the order operations are performed in, and any GPU algorithm is almost certainly guaranteed to do them in a different order than the CPU equivalent.

  • It is actually very hard to get a CPU to work in pure single precision these days. If the compiler is using x87 floating point instructions, then the intermediate results are actually stored in 80-bit floating point registers, which are even more precise than single or double precision. There are compiler-specific flags to force truncation of the extra precision (by forcing the registers to be written out to memory between every operation), but it is tricky.

  • Current CUDA devices do not exactly follow the IEEE-754 floating point standard for single precision operations. Appendix A explains the details, but in particular, division and square-root are not exactly compliant. Division can be as much as 2 ulps off, and square-root has a precision of 3 ulps. Double precision is fully IEEE-754 compliant on compute 1.3 devices, and with Fermi both single and double precision will follow the IEEE-754 standard.

Thanks for the explanation, it’s good to know that the newer cards will be fully IEEE-754 compliant.