CUDA_ERROR_LAUNCH_FAILED preceded by gpu FIFO (dma) failures

Hello,

I’m attempting to debug an intermittent issue, that can take hours to days to reproduce. Essentially I am running an application which constantly runs a tensorflow model. It can take hours, or even days, but I then get prints like the following in the kernel log:

[Tue Feb 11 14:07:01 2020] gk20a 17000000.gp10b: gk20a_fifo_handle_pbdma_intr: pbdma_intr_0(0):0x00040000 PBH: 20400000 SHADOW: 00000001 M0: 00000000 00000000 00000000 00000000
[Tue Feb 11 14:07:01 2020] gk20a 17000000.gp10b: gk20a_set_error_notifier_locked: error notifier set to 32 for ch 506
[Tue Feb 11 14:07:01 2020] gk20a 17000000.gp10b: gk20a_set_error_notifier_locked: error notifier set to 32 for ch 507
[Tue Feb 11 14:07:01 2020] gk20a 17000000.gp10b: gk20a_set_error_notifier_locked: error notifier set to 32 for ch 505
[Tue Feb 11 14:07:01 2020] gk20a 17000000.gp10b: gk20a_set_error_notifier_locked: error notifier set to 32 for ch 504
[Tue Feb 11 14:07:01 2020] gk20a 17000000.gp10b: gk20a_fifo_set_ctx_mmu_error_tsg: TSG 0 generated a mmu fault
[Tue Feb 11 14:07:01 2020] ---- mlocks ----

followed by CUDA_ERROR_LAUNCH_FAILED in the application. Any ideas on where to start with this, or any additional tools I could use?

Hi,

We need more log for this issue.
Would you mind to help us collect more logs?

Please run the application with GPU trace enabled.

$ NVRM_GPU_TRACE=1 [your app]

Ex.

$ NVRM_GPU_TRACE=1 python3 test.py

Thanks.

Thanks AastaLLL. Should I be seeing any additional prints, or only when the issue reproduces? So far I do not see any extra information when setting NVRM_GPU_TRACE=1

@AastaLLL - I still cannot get NVRM_GPU_TRACE to do anything. Any ideas?

Hi,

We want to reproduce this issue in our environment.
Would you mind to share a minimal reproducible source for us?

Thanks.

Hi,

Sorry for keep you waiting.

May I know which software do you use?
It’s known that there is an issue in the r32.2.1 and recommended to upgrade into the r32.2.3.

Thanks.

Thanks AastaLLL, Unfortunately, I don’t have a minimal reproducible example of this error. It is highly intermittent - taking 12hrs to 1 week to reproduce, running our application, which pegs the GPU utilization at 99%.

We are still stuck on an older jetpack release - specifically L4T 28.2.1.

You will almost certainly recommend updating to a modern BSP - we are working towards that as fast as possible. In the mean time, if you have any details on past known-issues that may explain this error, and any potential work-arounds or debugging methods, I’d be very grateful.

Thanks!

Hi,

We will need more information to give a further suggestion.
Would you mind to illustrate your use case more?

For example:
What is the pipeline? Ex. camera -> cuda -> tensorFlow ?
Do you use any decoder or encoder? If yes, may I know is H264 or H265?
Do you think this issue is related to the multi-threading?

Thanks.

Certainly.

Our application is a C++ application which uses gstreamer (with nvdec/omx) to decode a 4K RTP video stream. It then extracts ROIs from the decoded video frame, and performs object detection via the tensorflow C api (with the tensorflow library built to use CUDA/GPU), in addition it runs our own tracking & application logic, as well as performing some OpenCV based operations.

So, its a lot happening within one process (gstreamer, omx, opencv, tensorflow, cuda, our code), however we use this same source on different platforms (non-TX2), without issue, and we are quite thorough in using valgrind & other tools to avoid memory leaks/corruptions.

Does this help? Anything else I can provide?

Hi,

Sorry for the late update.
I will check this issue with our internal team and update more information with you later.

Thanks.

Hi,

Thanks for your patience.

Actually, this is a known issue that we check for several weeks.
Since this error is hard to reproduce, the progress is not ideal.

So if upgrading software is an optional for you, it’s more recommended to do so.
But we keep working on this issue and will update you for any progress.

Thanks.

Thanks for the update AastaLLL - we are continuing to work towards a BSP update, but that is still a fairly long way off. I’ll continue to monitor this thread, so please do let us know if/when you find something.

Thanks!

@AastaLLL Any update on this issue? It is still causing problems for me.

Hi,

Thanks for your patience.

This issue is still under debugging.
Will update more information once we got a progress.

Hi,
I have a same issue. Do you have any update for this issue?

Hi,

Please upgrade the OS into r32 to avoid this issue.
Thanks.