Is there a compatibility problem with recent CUDA driver on Maxwell architecture such as in GTX 970 graphic card?
Our company develops software tools for Electromagnetic and Infra-Red physically-based simulations. For about 5 years, we have been developing a CUDA version of our applications, using the Cuda driver API. Our software works correctly with satisfactory speedups on all the cards we tested from laptop GPUs up to Titan Black cards.
In order to simplify the compilation process and also to ensure compatibility with most of CUDA hardware and even unknown ones such as new Maxwell based graphic cards, we choose to compute our kernels as ptx for the compute capability 2.0.
This strategy used to work well on Fermi and Kepler architecture. However, this strategy relies heavily on driver compatibility and documentation of compatibility breaks. The problem is that we ran into undocumented compatibility problems both for Kepler and for Maxwell.
On Kepler, some integer texture fetches were done even if a correct execution of the code would have not done these fetches. These unwanted fetches were done with negative coordinates, and caused the kernels to return with error 719. We bypassed this by overlaying tex1Dfetch with a function that ensures that tex1dfetch is always called with positive or zero coordinates. We suspect that these fetches were done in advance for optimization purpose. In any case, we did not find any documentation about this.
On Maxwell architecture, we are currently facing problems which also seem to be caused by incorrect driver optimizations.
Indeed, on a GTX 970 graphic card our software works if using the driver version 343.98 provided with the Cuda 6.5 release and for a previous driver version 332.35 on GTX850M card. It works with these drivers even when using the CUDA 5.0 SDK (our current Sdk version). Newer driver versions such as 347.52 and the latest 347.88 and even the 347.62 (provided with CUDA 7.0) don’t work with our application: we obtain either bad results or kernel crashes or driver crashes. We have searched for an explanation in our own software code, as we did to find the texture prefetching problem on Kepler. But, even after heavy code simplification, the behavior
of some simple operations is unexpected
(e.g. CUDA error 714).
By now the only solution we found is to force our customers to use the 343.98 driver version when using a Maxwell graphic card. However it is not a long term solution and we are worried about the future.
The questions we ask are:
• Is there anyone facing the same problems as us?
• Are there changes in the Maxwell Cuda drivers optimizations that can explain this behavior?
• Are there some drivers certified for Cuda and some drivers that are not?
• What can we do to help in fixing such problems?
• In the future, will we have to worry about such compatibility problems with each new hardware generation?