Driver Crash on TitanX during kernel operation.

I have a piece of kernel code in which I address a 2D array with pitch d_cholp=512
If I address it like this, the driver crashes
tmp2 = chol[ cb + i*(d_cholp/sizeof(float)) + j ] ; //cb=cluster i=row, j=colum

If I address it like this, the driver does not crash.
tmp2 = chol[ cb + i*(128) + j ] ; //cb=cluster i=row, j=colum

How is this possible?

My card is a TitanX, platform VS2010 windows7, 64bit, pcie2.0 driver=9.18.13.5306
driver crash is 2 second black screen, comes back with message driver stopped.

The symptoms of the “crash” indicate that the kernel ran too long (more than a couple of seconds) and was terminated by the Windows watchdog timer whose tasks is it too keep the GUI from freezing for extended period of time.

Your kernel may be too complex to finish in the time allotted by the watchdog timer simply because it has too much work to do, or (seems more likely here) because it contains a bug that causes execution to “go off into the weeds”.

I would suggest reducing the size of the data operated on by the kernel, and then running under control of cuda-memcheck. Note that the return type of sizeof() is the unsigned 64-bit type size_t. This will introduce slower 64-bit arithmetic into the address computation, and it may cause trouble due to mixed signed/unsigned computation possibly producing integer wrap-around for the index, depending on the type of the other variables involved.

Thanks, that makes a lot of sense. I noticed if I made the problem smaller, it also did not crash anymore.
The target size probably just makes it in time with 128 and not when it has to do the division as well.
Is there a way to increase the time out period? There must be programs which use the kernel for much longer periods than a few seconds.

There’s a pinned posting at the top of this forum that you are now posting in, which describes this:

https://devtalk.nvidia.com/default/topic/459869/cuda-programming-and-performance/-quot-display-driver-stopped-responding-and-has-recovered-quot-wddm-timeout-detection-and-recovery-/

Thanks, increasing the wait time in the registry indeed fixed the problem.
Still a bit annoying that while I run my simulations, my whole 4 core cpu is dead, I can not do some browsing during the gpu operation and have to wait until it is finished.

Since CUDA kernel launches are asynchronous, unless you use synchronization calls that cause busy waiting for the CPU, the full processing power of the CPU is in fact available while the GPU kernel is running.

However, because the GPU is completely taken up by the CUDA kernel, the GUI is frozen because GUI updates also require the GPU, which is unavailable for graphics while the CUDA kernel is running.

A straightforward workaround is to acquire a cheap low-end GPU (for example, I have used a Quadro 600 in the past for this purpose, but there are likely cheaper options than that) and use that to drive the display, while the expensive high-performance GPU remains reserved for CUDA apps.

Getting back to your original question, given that the code seems to be bug free and merely affected by time-out, try pre-computing the stride at the start of the kernel:

const int stride = d_cholp / (int)sizeof(float);

Then later use the stride in the index computation:

tmp2 = _chol_[ cb + i*stride + j ] ; //cb=cluster i=row, j=column

That way you are avoiding the magic constant, while retaining good performance.

I never had the problem before when I had 2 graphics cards indeed. Unfortunately sold them both when I got the TitanX.
I would expect the optimizer to take the division out of the loop, but I had switched off all optimizations to debug this problem. Still better perhaps to write it optimized, just in case.