CUDA_ERROR_LAUNCH_FAILED preceded by gpu FIFO (dma) failures

jkeller · February 11, 2020, 8:34pm

Hello,

I’m attempting to debug an intermittent issue, that can take hours to days to reproduce. Essentially I am running an application which constantly runs a tensorflow model. It can take hours, or even days, but I then get prints like the following in the kernel log:

[Tue Feb 11 14:07:01 2020] gk20a 17000000.gp10b: gk20a_fifo_handle_pbdma_intr: pbdma_intr_0(0):0x00040000 PBH: 20400000 SHADOW: 00000001 M0: 00000000 00000000 00000000 00000000
[Tue Feb 11 14:07:01 2020] gk20a 17000000.gp10b: gk20a_set_error_notifier_locked: error notifier set to 32 for ch 506
[Tue Feb 11 14:07:01 2020] gk20a 17000000.gp10b: gk20a_set_error_notifier_locked: error notifier set to 32 for ch 507
[Tue Feb 11 14:07:01 2020] gk20a 17000000.gp10b: gk20a_set_error_notifier_locked: error notifier set to 32 for ch 505
[Tue Feb 11 14:07:01 2020] gk20a 17000000.gp10b: gk20a_set_error_notifier_locked: error notifier set to 32 for ch 504
[Tue Feb 11 14:07:01 2020] gk20a 17000000.gp10b: gk20a_fifo_set_ctx_mmu_error_tsg: TSG 0 generated a mmu fault
[Tue Feb 11 14:07:01 2020] ---- mlocks ----

followed by CUDA_ERROR_LAUNCH_FAILED in the application. Any ideas on where to start with this, or any additional tools I could use?

AastaLLL · February 12, 2020, 4:31am

Hi,

We need more log for this issue.
Would you mind to help us collect more logs?

Please run the application with GPU trace enabled.

$ NVRM_GPU_TRACE=1 [your app]

Ex.

$ NVRM_GPU_TRACE=1 python3 test.py

Thanks.

jkeller · February 12, 2020, 6:58pm

Thanks AastaLLL. Should I be seeing any additional prints, or only when the issue reproduces? So far I do not see any extra information when setting NVRM_GPU_TRACE=1

jkeller · February 20, 2020, 5:53pm

@AastaLLL - I still cannot get NVRM_GPU_TRACE to do anything. Any ideas?

AastaLLL · February 25, 2020, 9:15am

Hi,

We want to reproduce this issue in our environment.
Would you mind to share a minimal reproducible source for us?

Thanks.

AastaLLL · February 26, 2020, 2:10am

Hi,

Sorry for keep you waiting.

May I know which software do you use?
It’s known that there is an issue in the r32.2.1 and recommended to upgrade into the r32.2.3.

Thanks.

jkeller · February 26, 2020, 2:16pm

Thanks AastaLLL, Unfortunately, I don’t have a minimal reproducible example of this error. It is highly intermittent - taking 12hrs to 1 week to reproduce, running our application, which pegs the GPU utilization at 99%.

We are still stuck on an older jetpack release - specifically L4T 28.2.1.

You will almost certainly recommend updating to a modern BSP - we are working towards that as fast as possible. In the mean time, if you have any details on past known-issues that may explain this error, and any potential work-arounds or debugging methods, I’d be very grateful.

Thanks!

AastaLLL · February 27, 2020, 2:50am

Hi,

We will need more information to give a further suggestion.
Would you mind to illustrate your use case more?

For example:
What is the pipeline? Ex. camera → cuda → tensorFlow ?
Do you use any decoder or encoder? If yes, may I know is H264 or H265?
Do you think this issue is related to the multi-threading?

Thanks.

jkeller · February 27, 2020, 4:35pm

Certainly.

Our application is a C++ application which uses gstreamer (with nvdec/omx) to decode a 4K RTP video stream. It then extracts ROIs from the decoded video frame, and performs object detection via the tensorflow C api (with the tensorflow library built to use CUDA/GPU), in addition it runs our own tracking & application logic, as well as performing some OpenCV based operations.

So, its a lot happening within one process (gstreamer, omx, opencv, tensorflow, cuda, our code), however we use this same source on different platforms (non-TX2), without issue, and we are quite thorough in using valgrind & other tools to avoid memory leaks/corruptions.

Does this help? Anything else I can provide?

AastaLLL · March 4, 2020, 6:03am

Hi,

Sorry for the late update.
I will check this issue with our internal team and update more information with you later.

Thanks.

AastaLLL · March 17, 2020, 9:26am

Hi,

Thanks for your patience.

Actually, this is a known issue that we check for several weeks.
Since this error is hard to reproduce, the progress is not ideal.

So if upgrading software is an optional for you, it’s more recommended to do so.
But we keep working on this issue and will update you for any progress.

Thanks.

jkeller · March 17, 2020, 1:51pm

Thanks for the update AastaLLL - we are continuing to work towards a BSP update, but that is still a fairly long way off. I’ll continue to monitor this thread, so please do let us know if/when you find something.

Thanks!

jkeller · April 8, 2020, 9:02pm

@AastaLLL Any update on this issue? It is still causing problems for me.

AastaLLL · April 15, 2020, 8:58am

Hi,

Thanks for your patience.

This issue is still under debugging.
Will update more information once we got a progress.

branden.you · October 27, 2020, 1:29am

Hi,
I have a same issue. Do you have any update for this issue?

AastaLLL · November 9, 2020, 6:58am

Hi,

Please upgrade the OS into r32 to avoid this issue.
Thanks.

Topic		Replies	Views
CUDA_ERROR_LAUNCH_TIMEOUT (702): the launch timed out and was terminated Jetson TX2 hw , cuda , kernel	6	1773	January 16, 2024
fifo_error_isr Jetson TX1	1	799	September 7, 2017
failed to enqueue convolution on stream: CUDNN_STATUS_EXECUTION_FAILED Jetson TX2	10	1366	March 1, 2018
CUDA Fail when running Tensorflow inference Jetson TX2	10	3488	February 2, 2018
trouble with Tensorflow and TX2. Jetson TX2	1	1939	March 1, 2018
Gpu error,How to solve this problem? Jetson Xavier NX kernel	0	1083	October 10, 2020
run tensorflow 1.3 on tx2 stuck Jetson TX2	20	5805	October 18, 2021
CUDA_error_launch_failed when deploying tensorflow model Jetson TX2	4	2046	October 18, 2021
Jetson TX2 NX crash with gpu lockups Jetson TX2 gpu	2	926	January 19, 2022
TX2 mmu fault during V4L2 dmabuf capture + cuda analysis Jetson TX2 camera , hw	2	938	March 2, 2022

CUDA_ERROR_LAUNCH_FAILED preceded by gpu FIFO (dma) failures

Related topics