I’m attempting to debug an intermittent issue, that can take hours to days to reproduce. Essentially I am running an application which constantly runs a tensorflow model. It can take hours, or even days, but I then get prints like the following in the kernel log:
[Tue Feb 11 14:07:01 2020] gk20a 17000000.gp10b: gk20a_fifo_handle_pbdma_intr: pbdma_intr_0(0):0x00040000 PBH: 20400000 SHADOW: 00000001 M0: 00000000 00000000 00000000 00000000
[Tue Feb 11 14:07:01 2020] gk20a 17000000.gp10b: gk20a_set_error_notifier_locked: error notifier set to 32 for ch 506
[Tue Feb 11 14:07:01 2020] gk20a 17000000.gp10b: gk20a_set_error_notifier_locked: error notifier set to 32 for ch 507
[Tue Feb 11 14:07:01 2020] gk20a 17000000.gp10b: gk20a_set_error_notifier_locked: error notifier set to 32 for ch 505
[Tue Feb 11 14:07:01 2020] gk20a 17000000.gp10b: gk20a_set_error_notifier_locked: error notifier set to 32 for ch 504
[Tue Feb 11 14:07:01 2020] gk20a 17000000.gp10b: gk20a_fifo_set_ctx_mmu_error_tsg: TSG 0 generated a mmu fault
[Tue Feb 11 14:07:01 2020] ---- mlocks ----
followed by CUDA_ERROR_LAUNCH_FAILED in the application. Any ideas on where to start with this, or any additional tools I could use?
Thanks AastaLLL. Should I be seeing any additional prints, or only when the issue reproduces? So far I do not see any extra information when setting NVRM_GPU_TRACE=1
Thanks AastaLLL, Unfortunately, I don’t have a minimal reproducible example of this error. It is highly intermittent - taking 12hrs to 1 week to reproduce, running our application, which pegs the GPU utilization at 99%.
We are still stuck on an older jetpack release - specifically L4T 28.2.1.
You will almost certainly recommend updating to a modern BSP - we are working towards that as fast as possible. In the mean time, if you have any details on past known-issues that may explain this error, and any potential work-arounds or debugging methods, I’d be very grateful.
We will need more information to give a further suggestion.
Would you mind to illustrate your use case more?
For example:
What is the pipeline? Ex. camera → cuda → tensorFlow ?
Do you use any decoder or encoder? If yes, may I know is H264 or H265?
Do you think this issue is related to the multi-threading?
Our application is a C++ application which uses gstreamer (with nvdec/omx) to decode a 4K RTP video stream. It then extracts ROIs from the decoded video frame, and performs object detection via the tensorflow C api (with the tensorflow library built to use CUDA/GPU), in addition it runs our own tracking & application logic, as well as performing some OpenCV based operations.
So, its a lot happening within one process (gstreamer, omx, opencv, tensorflow, cuda, our code), however we use this same source on different platforms (non-TX2), without issue, and we are quite thorough in using valgrind & other tools to avoid memory leaks/corruptions.
Actually, this is a known issue that we check for several weeks.
Since this error is hard to reproduce, the progress is not ideal.
So if upgrading software is an optional for you, it’s more recommended to do so.
But we keep working on this issue and will update you for any progress.
Thanks for the update AastaLLL - we are continuing to work towards a BSP update, but that is still a fairly long way off. I’ll continue to monitor this thread, so please do let us know if/when you find something.