intermittent killer kernel Kernel which causes CUDA to die, followed by launch failures

I have found a simple kernel which seems to cause the CUDA environment to die. It is intermittent, requiring somewhere between

0 and 100k iterations. After “death”, subsequent CUDA calls get an “unspecified launch failure” until the calling program exits. I am using the kernel to do a high speed device memory to device memory copy, but other permutations of the kernel which do more useful work also exhibit the same behavior. Some thread block sizes work just fine, others have the problem. For example, xSize of 512 with 128 threads per block works just fine. xSize of 448 with 112 threads per block causes the CUDA environment to die after somewhere between 0 to 100k iterations. The kernel uses shared memory to stored reads from input device memory and then writes out to device memory

from shared memory. When shared memory is not used, the kernel runs much more slowly and the problem doesn’t happen.

Trying out xSizes of 32 to 512 in multiples of 16, failure was found for xSizes of 144,160,192, 272, 304, 400 and 448.

There may be others between 448 and 144. xSize of 128, 256, 384, 464-512 are definitely ok.

The thread block size is xSize/4.

CUDA 2.1 Beta toolkit and SDK

Visual Studio 2005

I have tried it on 2 platforms and with 2 different GPU cards

NVIDIA 181.20 driver

Intel Xeon 5120 1.86 GHz 2 GBytes RAM Windows XP

Intel Xeon E5410 2.33 GHz 8 Gbytes RAM Windows XP64

8800 GT and FX3700

Here is the kernel and wrapper function:

[codebox]global void

dummy_kernel(unsigned char *id, unsigned char *od, unsigned int yPitch, unsigned int zPitch)

{

unsigned int x = threadIdx.x * 4; 

unsigned int y = blockIdx.x;

unsigned int z = blockIdx.y;

__shared__ unsigned char shMem[512];   // 1 line of samples



unsigned int i = __mul24(z,zPitch) + __mul24(y,yPitch) + x;

*(int *)&shMem = (int)&id[i];

__syncthreads();

*(int *)&od[i] = *(int *)&shMem[x];

}

extern “C” void

testKernel(unsigned char *d_in, unsigned char *d_out, int xSize, int ySize, int zSize, int yPitch)

{

  dim3 gridSz(ySize,zSize);  

      dim3 blockSz(xSize/4);

      int zPitch = ySize * yPitch;

dummy_kernel<<<gridSz, blockSz>>>(d_in, d_out, yPitch, zPitch);

}

[/codebox]

Here is the calling and setup code:

[codebox]#include <stdio.h>

#include <cuda_runtime.h>

#include <cutil.h>

extern “C” void testKernel(unsigned char *d_in, unsigned char *d_out, int xSize, int ySize, int zSize, int yPitch);

//

// make a number of 3D device arrays for use as temporary buffers

//

void

createDeviceArrays(int xSize, int ySize, int zSize, int count, size_t *pitch, unsigned char **bufA)

{

cudaThreadSynchronize();

cudaPitchedPtr pitchedPtrA;

cudaExtent extentDesc = make_cudaExtent(xSize,ySize,zSize);

for (int j=0; j<count; j++)

{

	cudaError_t e = (cudaError_t) cudaMalloc3D( &pitchedPtrA, extentDesc );

	if (e)

	{

	    printf("ERROR:  createDeviceArrays %d  %s \n",e,cudaGetErrorString(e));

	}

    bufA[j] = (unsigned char*) pitchedPtrA.ptr;

	printf("createDeviceArray[%d]=%IX  size=%d %d %d \n",j,bufA[j],xSize,ySize,zSize);

}

*pitch = pitchedPtrA.pitch;

cudaThreadSynchronize();

}

#define N_3D_DEVICE_BUFFERS 2

unsigned char *d_buf3D[N_3D_DEVICE_BUFFERS];

size_t yPitch;

void

allocate(int xSize, int ySize, int zSize)

{

createDeviceArrays(xSize, ySize, zSize, N_3D_DEVICE_BUFFERS, &yPitch, d_buf3D);

}

void

process(unsigned char *src, unsigned char *dest, int xSize, int ySize, int zSize)

{

cudaError_t e;

// skip copying host buffers up to the device

//cudaThreadSynchronize();

testKernel(d_buf3D[0], d_buf3D[1], xSize, ySize, zSize, (int)yPitch);

//cudaThreadSynchronize();

// skip copying device buffers to host

if (e=cudaGetLastError())

{

	printf("testKernel cuda Error: %d %s\n",e,cudaGetErrorString(e));

	printf(" %d %d %d\n",xSize, ySize, zSize);

	while (1) {}  // just wait here

}

}

unsigned char buffer1[512256256], buffer2[512256256];

void

main()

{

// some “good” xSizes 512, 480, 256

// “bad” xSizes 448, 400

int xSize=448;

int ySize=137;

int zSize=72;

allocate(xSize,ySize,zSize);

printf(“pitch=%d\n”,(int)yPitch);

for (int k=0; k<100; k++)

{

for (int j=0; j<1000; j++)

{

  process(buffer1, buffer2, xSize, ySize, zSize);

}

printf("%d k iterations\n",k+1);

}

printf(“success\n”);

while (1)

{}

}[/codebox]

I would be happy with any information about this problem. I have a work around for now–just don’t allow those “bad” sizes

to be used. I just increase pad the data to the next highest “good” size and accept the performance penalty.

I would like to understand this problem and make sure my work-around is truly robust or have a fix.

By the way, this kernel which does device memory to device memory copies appears to be a lot faster than memcpy3D–perhaps

by about a factor of 4. Anybody have any idea why? Could it be related to my problem?

I have also attached the complete Visual Studio 2005 project. It should be placed in the SDK projects directory.

Thanks for any help.

cellophane man
cudaVIEbug.zip (3.47 MB)

compile with -deviceemu, run through Valgrind on Linux. it is always the best answer.

Holy crap. cellophane man, you have described EXACTLY the symptoms I’ve been experiencing on and off for years now. Maybe with your simpler code example, NVIDIA will finally take notice and see about solving the problem… or maybe that is just wishful thinking.

First, as Tim pointed out, we need to double check and ensure that there are no out of bounds memory accesses in your code. I’ll run this on my box tomorrow morning sometime. Assuming it checks out, I’ll post more details.

Good questions.
I did try the device emulation and the code passed (slowly) thru many iterations with no problems.
I also checked for out of bounds memory access by inserting if statements to ensure that
indices are in bounds with again no problem found.
cellophane man

OK, it is official: you have the same issue I do with a vastly different kernel. A run of valgrind through several hundred iterations (took forever!) confirms that the code, exactly as you pasted it to your original post, has NO out of bounds memory accesses.

Here is my independent description of the problem (I’ve been dealing with it for years, so I have lots of info).

    Certain kernels, even when called on exactly the same data will randomly ULF (unspecified launch failure)

    Oh, and these are kernels that are KNOWN 100% to have no out of bounds memory accesses (which is by far the most common cause of ULF), through meticulous double checking and verification by valgrind

    Sometimes, the frequency of ULFs can be changed by changing the block size, but I have never seen a changed block size totally remove the random ULF (you may need to run more iterations with your “good” sizes, cellophane_man)

    Sometimes, the kernel “hangs” and triggers the 5s timeout instead of causing a ULF. This is very rare.

    Trivial changes to a kernel (i.e. changing a for to a while loop or changing an if to a boolean operation) will cause the kernel to then work without ULFs

    Different kernels have different frequencies of ULFs. I’ve seen them ULF as often as a few thousand iterations, or as rare as every ~100 million (which is the case for one I’m currently trying to track)

    Sometimes hardware matters: Right now, I’ve got a kernel in HOOMD that runs fine on compute 1.0 and 1.3 devices, but will randomly ULF on compute 1.1 cards

    The CUDA version doesn’t matter. I’ve seen this issue in 0.8, 0.9, 1.0, 1.1, 2.0, and 2.1.

    Similarly, the driver version doesn’t matter.

    The host architecture and OS also doesn’t matter (get the same behavior on linux32, linux64, mac osx, windows xp and vista64)

    I’ve seen this behavior in kernels that use shared memory and some that don’t

    Judging by the lack of anyone else posting this issue to the forums (except for now), this problem is extremely rare.

I’ve attached my reproduction case. It is a lot more complicated than yours, but it is the simplest form I could whittle the actual production code down to while retaining the ULF behavior. Two workarounds are also included in the file: both of which change the branching pattern slightly. I have another kernel in HOOMD (no simple reproduction case for this one yet) that also seems related to the branching pattern. What is fascinating about your kernel with this behavior is that it has no branches.

Oh, my reproduction case has been on file with NVIDIA as a bug for, oh about a year and a half ago. They confirm that the issue is there, but won’t say anything other than “we are working on it”. The last update was 2/8/2008. Maybe my case is still to complicated to find the root cause.

This new case is super-simple, though. Please, please anyone at NVIDIA, don’t just leave this bug report by the wayside writing it off to “just another buggy kernel with out of bounds memory accesses”. There is a real and persistent problem here that has causing stability issues in HOOMD since initial development. Maybe if you can find the root cause to this simple case, it will be the same root cause as I’m seeing. Or if not, maybe related?

Anyways, I’m not optimistic. This problem has been around for 2 years and it is likely to stay and the resultant stress from finding workarounds will result in much lost hair :) Judging from all the places I’ve seen this show up where CUDA version, software, driver, OS don’t matter the only thing left is the actual GPU silicon. So I’m pessimistic and am calling it a very annoying hardware bug. I’m also pessimistic that anyone at NVIDIA will take notice… But then, I’m just a pessimist in general.
timeout_test3.cu.txt (7.58 KB)

Does the generated ptx change a lot in this case? (just curious)

I don’t think you are pessimistic from the short time I spoke to you, more realistic. You would guess NVIDIA would have been able to figure out what goes wrong in the meantime if enough priority was given (and remember that one or two generations of chips will probably have been designed in that time-span, all of which will probably still have the same bug).

This is a good question. Some very subtle bug triggered by a particular instruction sequence could be the issue here. An even smaller PTX test case might help illuminate the problem.

Both of these pass on my C1060, but my G84 (display card) has problems.

Will track this more later…

Oh yes. Very good question. I did look at this once, but I seem to recall the only difference being an order of initialization. I guess I should maybe dig into it again, there may be something promising. Maybe I’ll have time tomorrow.

The thing is, at least in my repro case, the same failing code will work if I give it a dataset with uniform n (see below). At least in my mind, that possibly rules out issues with invalid generated ptx.

This thing takes so long to even test if it is there, it’s hard to know if/when it is fixed. The code I have the most testing on, and have had the most luck with a workaround has:

for (int i = 0; i < n; i++)

   {

   .... do stuff....

   }

Where n varies from thread to thread. This code will crash with the ULF after a few hours of runtime.

When I change this loop to the below, it will run for months without problems.

for (int i = 0; i < max_n; i++)

	{

	if (i < n)

		... do stuff...

	}

where max_n is the maximum value of n across the entire grid.

Really? I was positive that at least my repro case failed on GTX 280.

I just ran it myself on the S1070 (and CUDA 2.1beta) I just got access to and it did run successfully. I’m shocked, as this is the first time I’ve seen that. Will post back in a bit with tests of my case and cellophane_man’s on other GPUs to crosscheck.

timeout_test3:
9800 GX2 / CUDA 2.1 / 180.22: ULFs after 100-1000 iterations
Tesla S1070 / CUDA 2.1beta / 180.06 : Ran 1 million iterations successfully

cellophane_man’s test:
9800 GX2 / CUDA 2.1 / 180.22 : ULFs after ~10k iterations
Tesla S1070 / CUDA 2.1beta / 180.06 : Ran 1 million iterations successfully

My 8800 GTX and Tesla D870 are busy doing real work, or I’d test them too.

I’m going to make the change back to the “broken” code in HOOMD and run a couple dozen 1-day long jobs and see if any of them fail as a more strenuous test of the hypothesis that mayhap the problem doesn’t exist on S1070/C1060.

Running kernel1 from timeout_test3 on FX4800 does also not show any problems.

I am not able to download the zip file with cellophane_man’s test at this time, so I cannot test that one unfortunately

You had me a little worried with your comment, MisterAnderson42, “…but I have never seen a changed block size totally remove the random ULF”. I ran over 10^9 iterations on a few of my “good” sizes with no lockup on the 8800GT, so my work-around seems reasonably robust and I can sleep a little easier. I’m continuing testing. I will feel better when we understand what is actually going on.

Thats great!

And just because I never found a block size to solve the problem in my case doesn’t precluded it from working in yours. I’ve learned to assume nothing with this issue. We may be triggering the same hardware problem, but in very different ways. Since my case is full of branches, mayhap the block size plays a bigger role than in your extremely simple kernel.

Oh, and I need to retract some of my earlier pessimism. Tim (tmurray) jumped right on this and put some time into it yesterday running these cases on a lot of different hardware and digging to the bottom of the issue. So far, it seems that the issue is not present on all hardware C1060 and newer. I’ve confirmed this (at least on S1070’s) with production runs in HOOMD: 1/2 a billion iterations of real simulation over 20 hours each on 16 GPUs without a problem. That is enough to convince me.

I’m happy that the problem is finally solved! And with the only relic being some workaround kernels in HOOMD activated when running on older hardware.

I’m happy he is looking into it, he sure is the person taking issues and going to the bottom of them.

Does anybody have any new information on this topic? I have been developing some more kernels for other purposes
that seem to exhibit this problem and my work-around of using “good” block sizes that worked so
well before is not working on these. Perhaps if the engineers at NVIDIA could explain more about what is going on I could
figure out how to create a more robust work-around rather than the time consuming trial and error method
of permuting the code, running a few million iterations and seeing what happens.
Thanks
cellophane man

Hi,

I find this thread very valuable.

I spent this week fighting with what seems to be a similar issue in my code. I hope my experience might be useful to others, and also I have a couple of questions to ask.

My kernel is fairly complex. The kernel is invoked about half a million times in one run of the executable, each time on several thousand blocks, each block fully occupies a multiprocessor, and runs 256 threads. The kernel uses the shared memory very extensively. The shared memory is “dynamically” requested by the host code, by means of specifying the third parameter to the kernel invocation call inside <<<>>> brackets. The typical kernel launch requires about a hundred milliseconds to complete. The executable takes about 5 hours to run on GTX280. All the interaction with the device (single, at the moment, will have to scale up to more devices later on) occurs from one thread, which is architectured in a way similar to Mr Anderson’s GPUWorker (http://forums.nvidia.com/index.php?showtopic=66598). The host-to-device interaction proceeds in the most simple manner, using cuda* functions only. The host-device interaction code doesn’t use C++ features. The code runs on a dedicated GPU under Linux, i.e. the X server runs on a different GPU. The driver’s version is 180.22, the CUDA’s version is 2.1 running on Ubuntu-8.04, x86_64, dual core.

What I’ve noticed, is that the kernel returns prematurely. I have a lot of sanity checks scattered around the kernel, each of which verifies a condition, and if the condition is violated, the check drops an error message into pre-allocated device RAM and returns from the kernel. One of such checks indicated an error. What seemed very interesting, is that on the failed invocation, all the blocks returned from the same point of the kernel, having detected a violation of the same condition. Overall, the code behaved very nicely: it was NOT an ULF, but simply a normal (from CUDA’s perspective) return from the kernel. So, I added a RAM-based array of debug structures (one per block) to my device code and had the kernel store some data into this structure, in the same exact way, for every block, on every kernel invocation. My host code would retrieve the structures from the device RAM and print them out in case of a premature return. Once I’ve added the kernel code to populate the debug structures, my kernel started generating the ULF, instead of exhibiting the premature return. Again, the ULF is generated only once in about a million of kernel invocations.

Needless to say, every attempt to reproduce the issues (either the premature return, or the ULF) from a dedicated stand-alone executable, in which the kernel was invoked on the exact same parameters, with the exact same set of blocks, as those, which have caused the issues, have failed.

Neither the ULF, nor a premature return crashes the driver of the card. The workstation doesn’t need to be rebooted after the ULF, and an immediate rerun of the executable can be successful.

My point is that the ULF might have another face to it, and it might be some kind of a memory/cache corruption. Interestingly, the check, that causes the premature return, compares a bunch of automatic variables (presumably, register-based) to a variable in the constant memory. This constant memory variable is referenced very frequently in the code, so it probably ends up being cached. My current theory is that this comparison returns a wrong result and this causes the premature return. To support this theory, I can add, that this check performs the exact same comparison, of the exact same numbers, for every block it ever runs in (it verifies, that a temp array fits into the shared memory).

Here’s another observation: sometimes (haven’t check every time), when I observe the ULF, I can see the following in my logs:

/var/log/messages

Apr 12 10:30:50 pacific kernel: [216357.389429] Machine check events logged

dmesg:

[177528.459236] NVRM: Xid (0001:00): 13, 0001 00000000 000050c0 00000368 00000000 00000100

[177528.461758] NVRM: Xid (0001:00): 13, 0001 00000000 000050c0 00000368 00000000 00000100

[216230.473116] NVRM: Xid (0001:00): 13, 0001 00000000 000050c0 00000368 00000000 00000100

[216357.389429] Machine check events logged

There’re a few more occurrences of this line in my dmesg (ALL the parameters after NVRM are the same since the last reboot). I don’t know what 0001:00 is, but my CUDA card is in the PCI-X:1 slot, and my X-server card is in the PCI-X:2 slot. I think I can explain why this line shows up thrice in the log: the code attempted to re-invoke the kernel three times, with no clean-up between the invocations, and all the three have failed (see my question below).

Even though the issue is a major annoyance (I have already spent a week tracking it down, and failed), given it’s very rare, I could perhaps live with it. But for this I need to be able to re-launch the kernel several times on an ULF, or another failure. CUDA manual and examples mention many times, that CUDA error has to be checked, using cudaGetLastError(), but I failed to find an example of how the error can be cleaned up, or handled in any other way, than “exit with a descriptive error message”. Specifically, I would like to do smth like:

NUM_ATTEMPTS = 3;

for i = 1 : NUM_ATTEMPTS {

   move my data into device memory; launch my kernel;

   if no error (premature return, cudaGetLastError, etc)

	  break;

   clean-up;

   reinitialize;

}

What I failed to find in the manuals, is how I can clean-up the device after a CUDA error. I would strongly prefer this to returning from the device management thread and starting another thread (in order to drop the context), but if this is the only solution, I guess I’ll have to live with it.

I agree with cellophane man: in case NVIDIA has a clue of what’s going on, maybe they can educate us how to work around this issue: maybe a frequent overwrite of constant memory is bad (i.e. once the constant memory is written by the host code, do not change it unless the contents absolutely needs to change), or cudaMalloc should be used as infrequently, as possible, but the chunks of memory should be reused, etc.

Thanks in advance for your advice!

cudesnick,

Thanks for reviving this card,

btw,
Is your CUDA card compute 1.1? Are you able to reproduce this problem with TESLA or any other compute 1.3 device or compute 1.0 device?

The only card I’m playing with at the moment is GTX280, which is a compute 1.3 device.

@cudesnick,

For the benefit of the community and NVIDIA people, Can you post the most minimal kernel that reproduces the issue (Even if it is after 1 million times, it is ok). This way, NVIDIA people can take this issue up and see if there is a loophole somewhere.

Thanks,
Best Regards,
Sarnath

Sarnath,

I would definitely love to, but I am afraid I cannot at the moment. There are two reasons:

  1. NDA, under which the code is being developed.

  2. It would take me a huge amount of work to significantly reduce the kernel. Note, that the bug shows up after running for many hours, so it would take me over a week to go through several kernel changes.

However, I’m working on trying to resolve the issue somehow. I’ll be posting onto this thread whatever I can find out.