Yikes, bad computation results with CUDA 7.5 release driver

cbuchner1 · September 30, 2015, 2:41pm

Hi,

out of curiosity I installed CUDA 7.5 on a GTX 970 equipped Ubuntu 12.04 system (I know, unsupported configuration…). I tested some of the included CUDA samples, these ran fine.

Interestingly one of our production kernels gave bad computation results after the driver update. The code was compiled for sm_30 with the CUDA 5.0 toolkit, and hence I suspect a problem with the runtime translation onto my GTX 970 GPU done by the driver.

Has anyone else observed known good kernels producing bad output after upgrading from an older driver (e.g. CUDA 7.0) to the driver shipping with the CUDA 7.5 release?

I downgraded the CUDA driver to the one included in the CUDA 7.0 toolkit and all is fine again.

Christian

njuffa · September 30, 2015, 3:25pm

You would probably want to report this to NVIDIA as a bug, although it seems possible that your original source code contained a latent bug that now has been exposed by the more aggressively optimizing compiler backend (ptxas) in the new driver.

The potential for cross-architecture JIT compilation to be broken is somewhat greater than for offline compilation. The offline compiler has to deal with exactly one combination of frontend and backend that is carefully tested. For JIT compilation there are numerous such combinations of many older frontend producing multiple different older versions of PTX that are subsequently compiled with one recent backend. That is a much bigger test space.

It is possible the new backend in the driver has a bug, but it is also possible the old frontend that produced the PTX used in your JIT compilation had a bug that went undetected due to artifacts of older backends’ code generation.

My recommendation would be to rely on JIT compilation only if absolutely necessary. Instead, create a fat binary which incorporated SASS for each GPU architecture your app needs to support, plus PTX for only the latest architecture for forward compatibility with yet to ship architectures.

cbuchner1 · September 30, 2015, 3:34pm

While we’d love to work without making use of JIT translation, we cannot currently target Maxwell GPUs with the CUDA 5.0 toolkit that we are building our application with. Upgrading the toolkit is considered an even bigger risk than upgrading the driver. ;)

We’ve previously used CUDA 2.3, switched to 5.0 not so long ago and the next leap may be to 7.0 (because I like that it supports lambda expressions and is still supported on Ubuntu 12.04).

As for creating a bug report: That faulty kernel is part of a bigger project, and building a simple repro case might cause too much effort for now. I’ll reconsider filing a bug report later.

Christian

njuffa · September 30, 2015, 4:13pm

In that case, unless you absolutely need some functionality provided by a newer driver, I’d say stick with the one you know works.

Whether upgrading the toolkit or the driver is a bigger risk I cannot answer based on data, but my gut feeling is that a driver change represents a bigger risk in terms functionally broken code: It is a complex low-level mechanism with close to zero visibility to the average CUDA programmer. With a toolkit change one can at least look at the generated code, check timing with the profiler, use cuda-memcheck to check for race conditions, etc.

As for using JIT compilation to target GPU architectures that did not exist at the time the application code was created, my long-standing recommendation is to switch to natively compiled code at the first chance, that is, asap.

Long-term use of JIT compilation is best limited to dynamic code generation scenarios, where code must be generated at run time.

Topic		Replies	Views
JIT Details CUDA Programming and Performance	14	3463	January 9, 2018
[Solved] Compatibility problem of ptx compute2.0 with GTX 970 (Maxwell) CUDA Programming and Performance	6	1782	October 5, 2015
From Kepler to Maxwell, do I need CUDA 6.5 ? CUDA Setup and Installation	8	3281	December 10, 2014
CUDA NVCC creates .target 5.0 CUDA Programming and Performance	4	771	January 12, 2017
CUDA 12.4 Compatibility with RTX 5070 Ti (Open Kernel Driver 570+) CUDA Setup and Installation cuda , linux-driver	5	216	June 30, 2025
Performance drop when changing Toolkit/SDK/driver from 4.2 ---> 5.0 CUDA Programming and Performance	14	2815	November 28, 2013
Running PTX Code from CUDA 4.0 in CUDA 4.1 or CUDA 4.2 CUDA Programming and Performance	5	2481	May 30, 2012
CUDA 3.2 Driver BROKE ? Oops.... CUDA Programming and Performance	20	11358	December 22, 2010
Cross compiling with CUDA toolkit 6.5 for Jetson -> runtime version > driver version Jetson TK1	8	1985	October 27, 2014
Toolkit on Customer Computer CUDA Programming and Performance	10	808	September 24, 2020

Yikes, bad computation results with CUDA 7.5 release driver

Related topics