702 launch timeout, 716 misaligned address and 999 unknown

Can you explain about those three errors?

All three errors appear at the same loop.

First, 702 launch timeout error occurs.

And after trying to run the code several times, the error code changes from 702 to 999.

Then, 716 misaligned address error starts to occur.

First, 702 launch timeout error occurs.

Are you on Windows? Then your WDDM driver’s watchdog timer has killed your kernel. You options are to increase the watchdog timeout, but this isn’t recommended since it involves editing your registry, or use a Tesla GPU with the TCC driver.

And after trying to run the code several times, the error code changes from 702 to 999.

999 is error “unknown”.

Then, 716 misaligned address error starts to occur.

These all point to some type of uninitialized memory or other memory problem.

Try running your binary under the “cuda-memcheck” utility to see if anything useful is detected. It doesn’t always help, but might give some clues.

  • Mat

Yes. I’m using windows.

Is it possible for GPU memory to be impaired when the watchdog kills my kernel?

Because, in first several tries error 716 doesn’t occur; instead, error 702 occurs.

Is it possible for GPU memory to be impaired when the watchdog kills my kernel?

Maybe, but I haven’t seen it before.

Because, in first several tries error 716 doesn’t occur; instead, error 702 occurs.

While I have no idea what’s wrong, given the error is non-deterministic, it does have the feel of uninitialized memory read (UMR).

Did using cuda-memcheck help?

  • Mat

No, it doesn’t provide any information other than misaligned address.

My theory is:

Since I have many derived type arrays, many dummy bytes have to be padded between each element to keep alignment.

Watchdog doesn’t fully initialize GPU memory when it kills the kernel, so after several times of kernel killing, GPU memory gets dirty(?), which makes it hard for derived type elements to fit in.

It’s just my guess.

Is this situation ever possible?

Is this situation ever possible?

I’m not sure. I typically only ever work with Tesla GPUs and the TCC driver since that what we officially support. WDDM and the GeForce are designed for games not scientific computing.

Do you have access to a Linux system or Windows system with a Tesla card and the TCC driver? That might give you clues if your guess is correct.

  • Mat