Random driver crashes with new drivers and old Maxwell Quadro GPU's

Hi!

We are facing a quite random and difficult to debug issue with our commercial software.

We currently support a set of specific server configurations, with Maxwell, Pascal, and Turing based Quadro 4000 GPU’s (so M4000, P4000 and RTX4000)

We have a software version, let’s say version A, that was certified to work with driver 419.67. It works fine in all the server configurations.

We have another software version, version B, that is beign certified to work with driver 442.92, and in fact requires this driver version or higuer, because we are using new feature in the NVENC API for better performance.

The issue is that version B randomly crashes, only in servers with two Quadro M4000.

Furthermore, we re-verified that version A does not crash on those servers, but it does randomly crash if we install driver version 442.92

We found the issue with 452.06 too

Is there any known bug in new drivers with Macwell GPU’s?

The rest of the system configuration

Windows 10 Enterprise 2015 LTSB
Intel Xeon CPU E5-1620 v3 @ 3,50GHz
16GB of DDR4 2133Mh
Two Quadro M4000 GPU’s connected to a full x16 PCIe 3.0 each
Server board and enclosure -> Dell Precission Tower 5810

The configurations that work include Dell Precisison Tower 5820 with dual P4000 GPU’s and Supermicro servers with Dual and triple Quadro RTX 4000 in Intel and AMD Epyc motherboards, with Windows 10 Enterprise 2016 LTSB

Thanks!

Hi,
Is there any log or more information about the crash?

Thanks!

Hi. I work with the author of this post. Indeed, we tried our product with Driver 442.92 and 452.06 on server with 2 M4000 GPUs and it crashes randomly in between 1 hour and up to 12 hours.
We tried the same software with 419.67 and it remains stable (we tested this for 3 days).
We have also tried on servers with 2 RTX4000 and everything works fine. Also, we could not reproduce this crash with Pascal.
So it seems there MUST be a bug with newer drivers when running on 2 M4000.
The bug happens completely random after just having 3 decoding feeds (full hd @ 25fps) (using the ffmpeg library) for a long time and doing some deep learning inference on the other GPU. It also happened without doing the DL inference but running a background substraction kernel from OpenCV. So the DL does not seem to be the cause.
Sometimes it crashed in 15 minutes, sometimes it takes 10 hours, it is quite variable. In general it takes between 1 hour and 2, but we observed the other cases as well. We also observe this crash on 4 different servers, all having two M4000 GPUs.

From the LOGS we have the following:
25-09-2020 22:44:52-432 ERROR [8048] - ffmpeg_t::avlog_cb[154] (0 times) decoder->cvdl->cuvidMapVideoFrame(decoder->decoder, cf->idx, &devptr, &pitch, &vpp) failed
25-09-2020 22:44:52-458 ERROR [5720] - PAFFMPEGCam::decodeFrameOnGpu[1707] (0 times) Error decoding video packet (error ‘Generic error in an external library’)
and finally
25-09-2020 22:44:52-654 FATAL [816] - paCheck[26] Throwing exception because of CUDA error: ‘cudaErrorLaunchFailure’ executing ‘cudaStreamSynchronize(m_stream)’ in ‘streamSynchronize’ function at …

Thanks

Hi, we have more information on the issue. We are narrowing down the driver version were this random crash happens.

We found that with 432.28 driver it does not happen (same as with 419.67).

We are testing now all the Quadro drivers in the R440 branch, before 442.92, which is the original version we where using, and the lowest version with which we managed to reproduce the issue.

Just a reminder, we could only reproduce it with dual GPU Quadro M4000 systems.

Last findings.

The first R440 driver crashes (and the rest too)

The last R430 driver does no crash (some other previous to that one do not either)

So it seem’s clear to us that the issue has been introduced in R440.

We understand that R440 is where VideoCodec SDK 9.1 was introduced, right? Maybe the issue is there?

We now will try to find if the crash always happens in the same part of the code, or if the crashes happen in random parts of the code.

The only solution we have right now for our M4000 based servers is to roll-back the driver version to the latest one in R430

Please let us know if you are looking into something, or if you need more information from us. Ask as much as you need, we will try to provide.

Thanks!

Hi, another update,

We have tested all possible combinations of deactivating some parts of the code to see what’s the cause, and everything crashes as long as there is enough GPU load. Only with R440 and R450 drivers. Againg R410 and R430 do not crash.

We will make a toy example that reproduces the issue. Is this the right place to file a driver bug?

Thanks!

Hi oamoros0ealf and martus1y5kj,

If you have a solid reproducer for a driver issue you should be able to file a bugreport with all required details to reproduce it directly through the “Customer Feedback” links at the bottom of this site: http://www.nvidia.com/page/support.html

We managed to have the reproducer. We filed a bug: 3221330