[Solved] Compatibility problem of ptx compute2.0 with GTX 970 (Maxwell)

Is there a compatibility problem with recent CUDA driver on Maxwell architecture such as in GTX 970 graphic card?

Our company develops software tools for Electromagnetic and Infra-Red physically-based simulations. For about 5 years, we have been developing a CUDA version of our applications, using the Cuda driver API. Our software works correctly with satisfactory speedups on all the cards we tested from laptop GPUs up to Titan Black cards.

In order to simplify the compilation process and also to ensure compatibility with most of CUDA hardware and even unknown ones such as new Maxwell based graphic cards, we choose to compute our kernels as ptx for the compute capability 2.0.

This strategy used to work well on Fermi and Kepler architecture. However, this strategy relies heavily on driver compatibility and documentation of compatibility breaks. The problem is that we ran into undocumented compatibility problems both for Kepler and for Maxwell.

On Kepler, some integer texture fetches were done even if a correct execution of the code would have not done these fetches. These unwanted fetches were done with negative coordinates, and caused the kernels to return with error 719. We bypassed this by overlaying tex1Dfetch with a function that ensures that tex1dfetch is always called with positive or zero coordinates. We suspect that these fetches were done in advance for optimization purpose. In any case, we did not find any documentation about this.

On Maxwell architecture, we are currently facing problems which also seem to be caused by incorrect driver optimizations.

Indeed, on a GTX 970 graphic card our software works if using the driver version 343.98 provided with the Cuda 6.5 release and for a previous driver version 332.35 on GTX850M card. It works with these drivers even when using the CUDA 5.0 SDK (our current Sdk version). Newer driver versions such as 347.52 and the latest 347.88 and even the 347.62 (provided with CUDA 7.0) don’t work with our application: we obtain either bad results or kernel crashes or driver crashes. We have searched for an explanation in our own software code, as we did to find the texture prefetching problem on Kepler. But, even after heavy code simplification, the behavior
of some simple operations is unexpected
(e.g. CUDA error 714).

By now the only solution we found is to force our customers to use the 343.98 driver version when using a Maxwell graphic card. However it is not a long term solution and we are worried about the future.

The questions we ask are:
• Is there anyone facing the same problems as us?
• Are there changes in the Maxwell Cuda drivers optimizations that can explain this behavior?
• Are there some drivers certified for Cuda and some drivers that are not?
• What can we do to help in fixing such problems?
• In the future, will we have to worry about such compatibility problems with each new hardware generation?

I think really want to get a registered developer account with nVidia and file bugs in their ticketing system. You will always want to include some kind of minimal repro case and state the exact circumstances under which the problem can be reproduced. Distilling your code into minimal repro cases may be tricky though.

As a short term workaround consider making a version of your software that specifically targets Compute 3.0 and higher. You can instruct the compiler to generate PTX and cubin code for any compute capability version you specify. At runtime CUDA will always pick the closest matching cubin or PTX version it finds in your binaries.



As cbuchner1 says, the recommended course of action is to file bug reports with NVIDIA, using the form linked from the CUDA registered developer website. If you are not a registered developer yet, the sign-up process is straightforward, and approval generally occurs within one business day. Please attach a self-contained repro code that demonstrates the issue.

While a bug in the tool chain is entirely possible, it is also possible that your code contains a latent bug that is exposed by newer compilers that implement more sophisticated optimizations.

I would also recommend changing the approach of using code generated for compute capability 2.0 across all architectures, presumably via JIT compilation. Other than potential overhead from JIT compilation, code compiled for sm_20 ise subject to the same sm_20 restrictions across all GPUs, even if the newer GPU architecture offers more resources and/or additional computational facilities.

Instead, I would recommend using a “fat binary” approach, where the executable incorporates binaries for each GPU architectures supported by the product, plus one PTX version built for the latest architecture, for JIT compilation on future architectures. That is how NVIDIA itself delivers their libraries. Building fat binaries is straightforward.

Yes, I noticed this as well, I also made a forum post about it. PTX is simply not fully future/backwards compatible, thus advice for now at least is to compile for each sm/compute and arch and such.

This is what my bandwidth currently does… and so far seems to work ok.

This is also somewhat documented behaviour but it does conflict with each other somewhat.

Goal of PTX is to be compatible… goal of arch is to be different… so there is a conflict and also goal not yet fully reached it seems.

Firstly, thank you for your rapid and kind answers.

I tried to compile in cubin 5.2 as you suggested, and it worked. However, compiling in ptx 5.2, as I tried before, does not work.

We would then have an altenative solution in delivering cubins.

However, I will quote the Maxwell compatibility guide :

In this case, our approach of delivering only an ptx should work. Even providing cubins for all existing architectures, we will not be able to ensure the compatibility of our code with future architectures. Actually, if we deliver a customer with a version of our code that works on his current graphic card, we cannot afford to make a new delivery if he wants to change his card for better performance.

Anyway, I think we will follow your advice about reporting a bug on JIT compilation, as it seems to be the only suited solution for compatibility.

We have now other emergencies, but, as soon as we have time, we will try to produce an example and file a bug report.

We have provided an example to NVIDIA. It was a bug. It has been solved in CUDA 7.5 release. The same code as before works as expected, in ptx and cubin.

Thanks everyone.