CUDA Error Generated for Computer Vision Projects running on single Jetson Xavier AGX GPU unit

We are using the Jetson Xavier AGX GPU to run our application on 3 separate docker containers. The 3 containers share the same single GPU.

Recently, our application started outputting cuda errors, but the string containing the error seems to never make it into the log.

Upon investigating this specific occurrence, we noticed a NVIDIA error at the exact same time as our application’s error in the syslog:

Line 40561: Nov 11 13:55:12 fs-19362 kernel: [13390120.907710] nvgpu: 17000000.gv11b nvgpu_set_error_notifier_locked:137 [ERR] error notifier set to 8 for ch 507

Line 40562: Nov 11 13:55:12 fs-19362 kernel: [13390120.911724] nvgpu: 17000000.gv11b gv11b_fifo_handle_ctxsw_timeout:1611 [ERR] ctxsw timeout error: active engine id =0, tsg=8, info: awaiting ack ms=3100

Researching further, we found similar references to this locking error, but they were never resolved:

TX2 nvgpu lockup - Jetson & Embedded Systems / Jetson TX2 - NVIDIA Developer Forums

[“The NVIDIA X driver has encountered an error; attempting to recover…” in Xorg.0.log and Xorg instability, is this hardware or software issue? - Jetson & Embedded Systems / Jetson Nano - NVIDIA Developer Forums]
link removed as i am a new user and i can only post 1 link

What do we need to do to get this problem addressed?

here is the other reference link:
https://forums.developer.nvidia.com/t/the-nvidia-x-driver-has-encountered-an-error-attempting-to-recover-in-xorg-0-log-and-xorg-instability-is-this-hardware-or-software-issue/122126

Hi,

Is there a failure rate or does the error always occur?

If possible, please share the way to reproduce this issue in our environment.
It will be very helpful to figure out the root cause.

Thanks.

Greetings AastaLLL,

We have experienced this condition about 3 times in the past 4 months. It takes about 3 to 4 weeks of continuous running before we see the error. I pulled the number of restarts from our containers on the two Xavier devices, a restart is triggered when we detect the CUDA error. None of the 6 containers are currently reporting restarts. This is likely due to our active development of models that are running on those devices requiring the container to be removed and instantiated with the new model. The oldest container is only 2 weeks old at this time.

To your question about reproducing the error, the container has been created by SAS and as such I am not able to share that with you.

In the absence of being able to reproduce the error in your lab, do you have any thoughts on what could be causing the error?

Thank you for your time and effort!

Hi,

Do you meet the GPU lockup?

It means that if you manually killed all the process that uses GPU.
The GPU load (with tegrastats) still reports 100% utilization and no GPU jobs can be done in this status.

Thanks.

AastaLLL,

the GPU is not locked up. In our experience, we have 3 docker images running that each use the GPU. When we have experienced the error, it only impacted one docker container. Stopping and restarting the docker container recovers the impacted container that reports the CUDA error.

Hi,

Which JetPack version do you use?

In our experience, the topic you shared will cause the GPU to hang (lockup as mentioned above).
If the GPU can run normally after the crash, this might be some user-level issue.

For example, the memory is not available for allocating or some invalid access.

Thanks.

Thank you AastaLLL for the continued support!

We are running on Jetson AGX with Jetpack 4.5 (L4T: 32.5.0) and another Jetson AGX with Jetpack 4.6 (L4T: 32.6.1) - running our application on both devices resulted in CUDA errors after some time. The behavior observed was (for the device running Jetpack 4.5):

  1. Our 3 containers would be running simultaneously without issue, utilizing the Jetson Xavier AGX on JetPack 4.5
  2. We noticed the NVIDIA GPU lockup error mentioned above, in the syslog
  3. We noticed CUDA errors shortly after (2), in our application’s log. We noticed this for one out of the three containers.
  4. The container that received the CUDA error, only recovers after stopping and starting the container.
  5. The other containers continue to function and run the application without issue.

Does this align with your definition of the GPU running “normally”? Let me know, thanks.

Hi,

It looks like a user space error only.

There are other containers that can run normally when the issue happens.
This indicates that the issue doesn’t from GPU itself. GPU can receive and finish jobs as usual.

The error container can restart the application without resetting the GPU (ex. reboot).
This also indicates that the crash is in the user space, not from the GPU firmware.

Thanks.

AastaLLL,

I thought the syslog error meant “no GPU jobs can be done in this status”, which would indicate an issue with the GPU. However, based on the overall behavior it “indicates that the issue doesn’t come from GPU itself”. Are you saying the syslog error should be ignored / is unrelated to the overall behavior?

additionally, I am including a docker log for a container that recently failed with a CUDA error and restarted. note that the container was started on 1/9 and the CUDA error was reported 1/28.

cudaRestartDockerLog.txt (96.0 KB)

Hi,

If a user-space application is early exist (ex. force killed), HW might show ctxsw timeout.
From the log you shared recently, I cannot find the GPU kernel error but some from the app side.

2023-01-28T19:56:27,263; ERROR; 122667301; DF.ESP; (dfESPwindow_compute.cpp:149); {99d61b7f-1024-436c-b2e5-ed6595ccfd8c}[Compute0004] dfESPwindow_compute::computeInsUpbl() for window ForceProjectError: Error processing event fields
2023-01-28T19:56:27,265; ERROR; 122667301; DF.ESP; (dfESPwindow.cpp:1107); {99d61b7f-1024-436c-b2e5-ed6595ccfd8c}[Window0001] dfESPwindow::compute() for window ForceProjectError: Received compute error <null key detected>,  event: [I,N: 0,], (#1 of 1)
2023-01-28T19:56:27,268; ERROR; 122667301; DF.ESP; (dfESPcontquery.cpp:874); {99d61b7f-1024-436c-b2e5-ed6595ccfd8c}[ContQuery0012] dfESPcontquery::runMT(): Event processing failed fatally from window <99d61b7f-1024-436c-b2e5-ed6595ccfd8c>::<computerVisionPanelTrackingDg>::<CQ>::<checkNoEvent> to window <99d61b7f-1024-436c-b2e5-ed6595ccfd8c>::<computerVisionPanelTrackingDg>::<CQ>::<ForceProjectError>
2023-01-28T19:56:27,269; ERROR; 122667301; DF.ESP; (dfESPproject.cpp:69); {99d61b7f-1024-436c-b2e5-ed6595ccfd8c}[Threads0004] contqueryLogic(): Call to runMT() failed

Thanks.

NvidiaSupportThread.zip (8.5 MB)
AastaLLL,

**attached is a zip file (NvidiaSupporThread) that has two files (docker_log_20221114.txt and syslog.6).

Here are some additional logs to clarify the situation from our perspective. In docker_log_20221114.txt we start the application at 2022-11-03T19:18:06,271 UTC time. As you can see, the application runs without issue, until we see CUDA messages starting at 2022-11-11T18:55:13,006 UTC time. After this point, we observe the application no longer is able to process any data (our output files do not grow). Looking in syslog (syslog.6), we noticed a nvgpu error at Nov 11 13:55:12 EST time, which would be Nov 11 18:55:12 UTC time. From our perspective, the application runs without issue, until we see the syslog error, and then exactly 1 second later our application outputs CUDA messages and no longer functions (our output files do not grow). Restarting the application resolves the issue at this point, but we would like to understand why the syslog error seems to lead to this (restarting constantly is not going to be a feasible solution).

In addition, the "cudaRestartDockerLog.txt " log is from a newer version of our application. The new version force kills and restarts when it encounters the issue, these messages are a result of that, and unrelated to the original scenario:

2023-01-28T19:56:27,263; ERROR; 122667301; DF.ESP; (dfESPwindow_compute.cpp:149); {99d61b7f-1024-436c-b2e5-ed6595ccfd8c}[Compute0004] dfESPwindow_compute::computeInsUpbl() for window ForceProjectError: Error processing event fields
2023-01-28T19:56:27,265; ERROR; 122667301; DF.ESP; (dfESPwindow.cpp:1107); {99d61b7f-1024-436c-b2e5-ed6595ccfd8c}[Window0001] dfESPwindow::compute() for window ForceProjectError: Received compute error , event: [I,N: 0,], (#1 of 1)
2023-01-28T19:56:27,268; ERROR; 122667301; DF.ESP; (dfESPcontquery.cpp:874); {99d61b7f-1024-436c-b2e5-ed6595ccfd8c}[ContQuery0012] dfESPcontquery::runMT(): Event processing failed fatally from window <99d61b7f-1024-436c-b2e5-ed6595ccfd8c>:::::: to window <99d61b7f-1024-436c-b2e5-ed6595ccfd8c>::::::
2023-01-28T19:56:27,269; ERROR; 122667301; DF.ESP; (dfESPproject.cpp:69); {99d61b7f-1024-436c-b2e5-ed6595ccfd8c}[Threads0004] contqueryLogic(): Call to runMT() failed

Hi,

Is it possible to log the tegrastats output at the same time?

2022-11-11T18:55:13,006; WARN ; 46847421; DF.ESP.SA.TKSAAST; (sklstoj.c:119); cuda: 
2022-11-11T18:55:13,006; WARN ; 46847421; DF.ESP.SA; (sklstoj.c:119); Error occurred when processing a data event in score window "<9667b518-b497-4b34-9ff0-cbf3f1b03195>::<computerVisionPanelTracking>::<CQ>::<scoreCVModel>"

The information from the log is quite limited.
It’s even harder to find out the root cause without an efficient way to reproduce.

Thanks.

AastaLLL,

We are looking into setting up a mechanism to log the tegrastats output. In the meantime, we understand the information from the application’s log is limited, however, are you able to gather anything from the syslog now that we have clarified the issue? In addition, here is another example of the issue occurring, which occurred on Feb 18th.
230218_CUDAError.zip (18.2 KB)

Hi,

We want to reproduce this issue internally.
Is it possible to share some setup or steps so we can test it internally?

Thanks.

Hi,

Our internal team are checking the syslog of 11/11 and 02/18.
Will share some patches with you to collect more infomation later.

Thanks.