Porting from Maxwell (TITAN X) to Pascal (GTX 1080) unspecified kernel launch error


I am trying to port CUDA code from TITAN X to GTX 1080. The code worked very well on TITAN X, even on multiple GPUs. One GTX 1080, I get very strange behavior.

Sometimes, kernels will take a long time to launch (200 ms or more). Sometimes these will not launch at all. I have disabled the windows driver timeout detection. Cuda-memcheck does not run the code, it simply hangs, and once it produced a BSOD on Windows 7. I have placed error code checks on every single CUDA and kernel invocation, and there is no indication of error until the driver starts to produce unspecified kernel launch errors. It seems like there is some timeout being tripped as it takes about 2 seconds for the unspecified kernel launcher error to be triggered. I have tried to put a cudaDeviceSynchronize() after every single call just to ensure there are no overlapping accesses. I have adjusted the numbers of blocks/threads to no avail. nvprof refuses to run as well, including complaining about “libcuinj” not being present. I have reinstalled the latest display driver and CUDA 8.0.27 twice.

I suppose this serves me right for being on the bleeding edge, but its very frustrating. Anyone have any experience with a similar problem?


Which version of CUDA are you using to build the code? CUDA 7.5 or CUDA 8.0 RC? Is CUDA 8.0.27 the version of the CUDA driver you have installed? What is the exact nvcc commandline used to build the code? Can you post a minimal, complete, buildable and runnable example code that reproduces the issue you are seeing?

Independent of any potential problems you may be encountering with new NVIDIA software components, I think one should consider the possibility of latent bugs in the existing code, including the possibility of unchecked API status returns. When you run the existing app on the Titan X under control of cuda-memcheck, with out-of-bounds, race condition, and API checkers enabled, are any errors reported? If none are reported with the Titan X, are any reported when you run on the GTX 1080.

Which NVIDIA driver is installed? Is it 368.69?

I ask because while this latest driver fixed one bug related to global memory reads from a large buffer, some of the bandwidth tests (in the CUDA SDK and other bandwidth tests) have been throwing the same error.

It could be an hidden bug in your code, but I have seen this same error appear with this latest driver (and only with this driver).

There is anew driver 369.81 that was released today so try updating to the new driver and see if it solves your issue.

Hello thanks for your replies! I installed the following files:


so I think I have the latest CUDA installed (but apparently not the latest display driver anymore). I uninstalled CUDA 7.5 to make sure there could not be confusion between the two versions when compiling or executing.

So on the suggestion of njuffa I took out the GTX 1080s and put back in the GTX TITAN Xs. I ran memcheck on my code without “-G -g” compiled in, and it reported no errors. I tried all four modes (race mode, memcheck mode, synccheck, etc). One complicating factor, however, is that I am running my CUDA code as a MEX (DLL) file under MATLAB. Therefore I am running the following memcheck command:

cuda-memcheck --report-api-errors all --tool memcheck --save memcheck.txt “C:\Program Files\MATLAB\R2015b\bin\matlab.exe” -nojvm -nosplash -nodisplay -nodesktop -wait -sd “c:\Users\Stationary\Desktop\CUDACode” -r “headless;exit;”

which runs MATLAB as a single process without its Java front end (headless is a matlab script to run my CUDA MEX file). I have used the same technique in the past to run my program through Visual Profiler successfully (on the TITAN X only). When I try to run my process with " -G -g " debugging built in, the cuda-memcheck takes so long as to be unusuable (I allowed it to run overnight, but it with no results and the computer effectively crashed as the WDDM timeout is disabled and therefore the display was not functioning.

Unfortunately, I have not been able to successfully bisect my code and form a smaller subset that can reproduce the error, and the data sets I work on are very large, otherwise I would provide a code sample. My application is processing radar data sets, and there is some sensitivity to releasing the whole code library to the world (at least at the present time).

I will try the new driver and see how it does. Thanks for your help, and let me know if you have any further ideas.



The new version of the driver fixed the problem. Thanks!

One thing I have noticed is that while the run time is faster on average, it seems to be more variable than the TITAN X. Perhaps there’s still some details to be worked out, but I have four GTX 1080s working in parallel. I am trying to figure out if there’s some strange kernel launch latency issues.



Glad to hear that resolved the issue.

I work quite a bit with CUDA/C++ mex files called from MATLAB, and have learned that often MATLAB will get ‘confused’ when you use multiple GPUs. I believe that MATLAB tries to keep track of GPU usage specifics due to its own GPU capabilities which can be invoked through its ‘Parallel Computing Toolbox’.

My approach has been to always keep all the CUDA invocations and usage in the mex file and make sure to to set the last GPU used to your intended default and call cudaDeviceReset() at the end of the application.

Also just to keep things simple I prefer to profile from a ‘stand alone’ application via nvprof or nvvp rather than work around profiling the mex file through MATLAB. I know it is possible to profile mex file but in the past I have had weird issues and figured that it is more time efficient to create an isolated console test case.