Simple CUDA program hitting size limits/errors on Windows but not Linux

I’m teaching a course on parallel computation using GPUs. This year we’re moving the course from Linux to Windows. Most programs converted fine, but one of the programs that worked fine on Linux gives errors on Windows when the data set gets reasonably large.

I’ve pruned the program down to a minimal example that illustrates the behavior. It runs fine for 512x512 and 1024x1024 matrices, but 2048x2048 matrices result in “unspecified launch failure” from cudaGetLastError() after launching the kernel.

We’ve already learned about:

  • running the GPU as a computing device only (not used for graphics), so Windows does not time us out
  • setting Visual Studio to the x64 build

A few more notes:

  • Visual Studio 2017, build 15.8.9
  • NVidia NSight Visual Studio Edition
  • GPU = Quadro K620
  • This pruned-down example runs fine with a Release build; the error occurs only in the Debug/x64 build

Many of the lines of code will look quite irrelevan – but remember that this example is quite pruned down, and removing those lines of code caused the issue to vanish.

Here’s the code:

using namespace std;
#include <cuda_runtime.h>

#define LOG(args) cout << args << endl
#define DIE(args) { cout << args << endl; exit(0); }
#define ERR_CHK(status, args) if (status != cudaSuccess) DIE (args << " (error code “<<cudaGetErrorString(status)<<”)")

const int BS=32;
global void mat_mult (float *dev_mem) {
shared float SA[BS][BS];
shared float SB[BS][BS];
float sum=0.0;
for (int kB=0; kB<64; ++kB) {
SA[0][0] = 0.0;
SB[0][0] = 0.0;

for (int kI=0; kI<BS; ++kI)
    sum += (SA[0][0]) * (SB[0][0]);

dev_mem [0] = 0;


static void run (int N) {
LOG (endl<<“Working on “<<N<<“x”<<N<<” matrices.”);

int sizeBytes = 16;
float *dev_mem = NULL;
cudaError_t err = cudaMalloc((void **)&dev_mem, sizeBytes);
ERR_CHK (err, "Failed to allocate device mem");

int NBLK = N / BS;		// # of blocks in any one dimension
dim3 grid (NBLK, NBLK), block(BS, BS);
mat_mult <<<grid, block>>> (dev_mem);
err = cudaGetLastError();
ERR_CHK (err, "Failed to launch/complete the mat_mult() kernel");

err = cudaDeviceSynchronize();
ERR_CHK (err, "Failed to synchronize");

err = cudaFree(dev_mem);
ERR_CHK (err, "Failed to free device memory");

LOG ("Success");


// Main() lives on the CPU.
int main() {
run (512);
run (1024);
run (2048);
run (4096);
return (0);

Apologies for the ugly code formatting – it was correctly indented when I pasted it into this window, but something seems to have removed all of the leading spaces!

Your sample worked fine for me

Working on 512x512 matrices.

Working on 1024x1024 matrices.

Working on 2048x2048 matrices.

Working on 4096x4096 matrices.

Can you let me know your driver version, CUDA toolkit version, and your project properties > CUDA c/c++ > Device > Code Generation settings are.
I’ll ask our QA group to see if they can reproduce your issue.

Interesting… thanks for checking. I’m off for a few days now, but I’ll update the post with all of the info as soon as I return.

Here is the information you requested (and a bit more as well).

Driver version = Quadro K620 411.31 (from NVidia control panel)

CUDA toolkit = release 10.0, V10.0.130, Built on at_Aug_25_21:08:04_Central_Daylight_Time_2018 (from nvcc --version)

Code generation settings = compute_50, sm_50

Windows SDK 10.0.17763.0
Configuration Debug x64

Final compilation line = C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.0\bin\nvcc.exe" -gencode=arch=compute_50,code=“sm_50,compute_50” --use-local-env -ccbin “C:\Program Files (x86)\Microsoft Visual Studio\2017\Enterprise\VC\Tools\MSVC\14.15.26726\bin\HostX86\x64” -x cu -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.0\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.0\include" -G --keep-dir x64\Debug -maxrregcount=0 --machine 64 --compile -cudart static -g -DWIN32 -DWIN64 -D_DEBUG -D_CONSOLE -D_MBCS -Xcompiler "/EHsc /W3 /nologo /Od /Fdx64\Debug\vc141.pdb /FS /Zi /RTC1 /MDd " -o x64\Debug\ “Q:\193STP\2017f\windows\bug\”

Looks reasonable to me. Moving to a CUDA Programming thread board for better help

This looks like a WDDM TDR timeout to me.

What is the output from running nvidia-smi.exe in a command prompt window?

If that exe is not found/runnable, please use the search bar in the windows file manager to find the location of nvidia-smi.exe on your machine (search your local hard drive), and run it from there.

Please also provide the output from running the deviceQuery sample code on that machine.

(leaving my signature here just in case)

Thanks for the suggestions. I would be surprised if it turns out to be a WDDM TDR timeout issue, since:

  • we originally had the code fail with ““Failed to copy result from device to host (error code the launch timed out and was terminated)”
  • we then moved the display connector from being driven by the Nvidia GPU to being driven by the CPU’s internal graphics hardware
  • ran dxdiag to confirm that Windows was no longer using the GPU as a display driver
  • the “launch timed out and was terminated” messages no longer appeared after that.

Here are the responses to your requests for more data. First, running nvidia-smi.exe gives
c:\Program Files\NVIDIA Corporation\NVSMI>nvidia-smi.exe
Fri Jan 04 17:22:06 2019
| NVIDIA-SMI 411.31 Driver Version: 411.31 |
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| 0 Quadro K620 WDDM | 00000000:01:00.0 Off | N/A |
| 35% 49C P8 1W / 30W | 40MiB / 2048MiB | 0% Default |

| Processes: GPU Memory |
| GPU PID Type Process name Usage |
| No running processes found |

Next, running deviceQuery (I found one already compiled) gives
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.0\extras\demo_suite>deviceQuery.exe
deviceQuery.exe Starting…

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: “Quadro K620”
CUDA Driver Version / Runtime Version 10.0 / 10.0
CUDA Capability Major/Minor version number: 5.0
Total amount of global memory: 2048 MBytes (2147483648 bytes)
( 3) Multiprocessors, (128) CUDA Cores/MP: 384 CUDA Cores
GPU Max Clock rate: 1124 MHz (1.12 GHz)
Memory Clock rate: 900 Mhz
Memory Bus Width: 128-bit
L2 Cache Size: 2097152 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
CUDA Device Driver Mode (TCC or WDDM): WDDM (Windows Display Driver Model)
Device supports Unified Addressing (UVA): Yes
Device supports Compute Preemption: No
Supports Cooperative Kernel Launch: No
Supports MultiDevice Co-op Kernel Launch: No
Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.0, CUDA Runtime Version = 10.0, NumDevs = 1, Device0 = Quadro K620
Result = PASS


In my experience, the error status “launch timed out” has no false positives. The only way to get this status is for a kernel to exceed the GUI watchdog time limit.

The Quadro K620 is a very basic entry-level device (i.e. very slow), so triggering the GUI watchdog timer seems likely. The purpose of the watchdog timer is to prevent the GUI from being blocked by a long-running GPU kernel. So it makes sense that moving GUI service to internal CPU graphics solved the issue with compute kernels occupying the Quadro for “long” periods of time. Using a debug build will slow down execution speed significantly, making it much more likely to hit the timeout limit.

Now that you got things working with the Quadro, what kernel execution times do you observe for this application? While typical limits are around the two second mark, differences in the specific value of the GUI timeout limit between various operating systems are likely. BTW, is the hardware platform of the Linux and Windows systems completely identical, i.e. the only thing that was changed is the installed OS?

I compiled the posted program with CUDA 8.0 on a Windows 7 machine and it ran without any issues. Also, cuda-memcheck did not report any errors.

What is the exact nvcc command line that was used to build the application? Are you building for the correct GPU architecture (the Quadro K620 seems to be the lowest-end Maxwell-based Quadro, so build for sm_50)?

Njuffa: thanks for your suggestion. However, I wonder if you’ve misread the thread slightly.

“Launch timed out” issue was the error that we were previously getting. Then we fixed that error by using the Intel CPU as the display-driver hardware, with the Nvidia GPU used only as a compute engine. Thus, I believe that the WDDM timeout should no longer be relevant.

The error that we are currently getting is not “launch timed out;” rather, it is “unspecified launch failure”.

To answer your questions:

  • the size that is failing is taking just under 3 seconds (which could indeed be consistent with a timeout, other than the notes just above)
  • Windows and Unix are running on exactly the same hardware systems. They were all purchased at once, and then we built Linux systems with some and Windows with others, to outfit three computing labs.


I am not sure that is actually true, maybe Robert Crovella knows for sure one way or the other. WDDM is incredibly intrusive and insists on keeping maximum control of graphics subsystems, even more so with WDDM 2.x that is found in Windows 10.

The normal way to get around this is to operate in TCC mode, as this makes the GPU into a “3D controller” rather than a graphics device, as far as Windows is concerned. I am not sure that low-end Quadros are supported by TCC, but you could try.

I am reasonably sure that I have seen time-out issues reported as “unspecified launch failure” before, so the reverse of what I stated above is not necessarily true. In other words, if the status reports timeout, a timeout happened for sure, but when there is a timeout, it may not always report as timeout. Whether this should be considered a bug, I would no be able to say. Because the watchdog timer graphics reset enforced by the OS is a big mallet, it may not always be possible to properly detect that this is what happened, beyond the fact that the kernel was terminated abnormally.

I believe it is WDDM TDR. None of the things you did have any effect on that. Your GPU is in WDDM mode, the deviceQuery output explicitly states your kernels have a runtime limit, and you have not given any indication that you have disabled the WDDM TDR mechanism. Furthermore larger grid sizes provoke the effect, and you are hitting the error at 2-3 seconds kernel duration. Also, switching from debug to release mode and the error goes away indicating that shorter kernel run times are the key issue, not any logical flaw in your code, or violation of any underlying hardware limits (other than the WDDM TDR duration limit).

I would start there.

Again, your statements about moving display connectors and running dxdiag have no bearing on the WDDM TDR mechanism.

“unspecified launch failure” can be the observed error in a WDDM TDR event:

A simple experiment that increases the TDR limit or disables it altogether should settle whether this is the root cause of the observation (which I believe it is). Microsoft documents the relevant registry keys here:

OK, I’ll give it a try. Previously our sys admins had not wanted me to touch the registry, but you’ve now given me a reason to push on them a bit harder. I’ll let you know the results (most likely middle of next week).


If the registry change doesn’t work, it probably means the registry change wasn’t made properly. Modifying the windows registry directly is not a trivial matter in my opinion (we can agree to disagree). There are plenty of examples on the internet where folks have modified it incorrectly.

Over the course of the last 20 years of my career, I’ve modified the windows registry maybe dozens of times. Not once did I feel like I knew what I was doing, like I understood what I was doing, or that I had a high level of confidence that what I was doing was correct. The plethora of versions of even just windows 10, along with the propensity of microsoft to change registry organization details with each and every version, makes me pretty nervous about it. I’ve looked at that microsoft registry page and it inspires no confidence in me whatsoever. Best of luck to you if you take that approach.

I’m much more comfortable advising others to use the method provided by Nsight Visual Studio Edition, which is described in the link I already provided. Yes, ultimately it is probably the same under the hood, but you’re using a tool that has very simple controls, is plainly evident as to its purpose, is maintained by a team of engineers, and goes through QA testing on a regular basis.

You’re welcome to do whatever you wish, of course, but I’ll repeat myself: if direct modification of the registry doesn’t affect the behavior, the most likely explanation in my opinion, in this scenario where every conceivable data point is indicating a TDR event, is that the registry change was made incorrectly.

Note that I consider this sort of test (modification of TDR settings) to be a diagnostic; simply to answer the question “is this due to TDR?” I don’t consider it a robust production methodology. When the TDR is disabled for example, even on a “non-display” GPU, I have witnessed windows becoming unstable if a kernel takes too long (say, longer than about 20 seconds). I’m not saying its guaranteed to be unstable, and if you feel that your experience is different, so be it - do what you wish; we can agree to disagree. This is all just my opinion anyway. In a teaching environment, where students might do anything, I wouldn’t recommend changing the TDR setting on a WDDM GPU (except maybe to go from 2 seconds to 5 seconds or something like that, if that’s terribly useful). Instead, design work that doesn’t take 2 seconds to run, or invest in GPUs that can be placed in TCC mode, or switch to linux. If you design assignments that should run in less than 0.1 second, then if a student hits the TDR, they’ve basically done something wrong, and it’s a much more predictable outcome than having a disabled TDR.

Robert and Njuffa: thanks! You were indeed correct. Increasing the timeout interval made the problem go away.

I think we’ll wind up upping the WDDM TDR timeout just slightly (e.g., follow Robert’s suggestion and go to 5 seconds); enough to let them compare reasonably-sized problems on a CPU vs. GPU, but hopefully not enough to cause any Windows issues.


Just out of curiosity, how long does it take to run your mat_mult() kernel on a K620 after you fixed TDR ?
According to nvprof, on a 1080Ti it takes:

GPU activities:  100.00%  960.12us         4  240.03us  13.120us  720.40us  mat_mult(float*)

cudaMalloc was what took most time here:

API calls:   98.48%  241.06ms         4  60.265ms  354.52us  238.07ms  cudaMalloc

2048x2048 was 4.22 seconds (almost all of which was in the kernel call, rather than in cudaMalloc).

The reason I asked is because your run times look a bit strange. I understand the K620 is entry-level and focused on energy efficiency, so I also compiled your code on my Macbook Pro 15" from 2012 which has a GT650M (Kepler) with 1GB and 128bit memory bus. This way the comparison is more fair. Here is the result of a run (didn’t use -O3, just “nvcc -o program”):

GPU activities:  100.00%  10.620ms         4  2.6549ms  129.25us  7.9911ms  mat_mult(float*)
API calls:        86.74%  83.091ms         4  20.773ms  150.01us  82.580ms  cudaMalloc

I don’t really know the performance difference between these cards, but I don’t think it is in the order of just a few milliseconds to 4 seconds.