CUDA Error Generated for Computer Vision Projects running on single Jetson Xavier AGX GPU unit

We are using the Jetson Xavier AGX GPU to run our application on 3 separate docker containers. The 3 containers share the same single GPU.

Recently, our application started outputting cuda errors, but the string containing the error seems to never make it into the log.

Upon investigating this specific occurrence, we noticed a NVIDIA error at the exact same time as our application’s error in the syslog:

Line 40561: Nov 11 13:55:12 fs-19362 kernel: [13390120.907710] nvgpu: 17000000.gv11b nvgpu_set_error_notifier_locked:137 [ERR] error notifier set to 8 for ch 507

Line 40562: Nov 11 13:55:12 fs-19362 kernel: [13390120.911724] nvgpu: 17000000.gv11b gv11b_fifo_handle_ctxsw_timeout:1611 [ERR] ctxsw timeout error: active engine id =0, tsg=8, info: awaiting ack ms=3100

Researching further, we found similar references to this locking error, but they were never resolved:

TX2 nvgpu lockup - Jetson & Embedded Systems / Jetson TX2 - NVIDIA Developer Forums

[“The NVIDIA X driver has encountered an error; attempting to recover…” in Xorg.0.log and Xorg instability, is this hardware or software issue? - Jetson & Embedded Systems / Jetson Nano - NVIDIA Developer Forums]
link removed as i am a new user and i can only post 1 link

What do we need to do to get this problem addressed?

here is the other reference link:
https://forums.developer.nvidia.com/t/the-nvidia-x-driver-has-encountered-an-error-attempting-to-recover-in-xorg-0-log-and-xorg-instability-is-this-hardware-or-software-issue/122126

Hi,

Is there a failure rate or does the error always occur?

If possible, please share the way to reproduce this issue in our environment.
It will be very helpful to figure out the root cause.

Thanks.

Greetings AastaLLL,

We have experienced this condition about 3 times in the past 4 months. It takes about 3 to 4 weeks of continuous running before we see the error. I pulled the number of restarts from our containers on the two Xavier devices, a restart is triggered when we detect the CUDA error. None of the 6 containers are currently reporting restarts. This is likely due to our active development of models that are running on those devices requiring the container to be removed and instantiated with the new model. The oldest container is only 2 weeks old at this time.

To your question about reproducing the error, the container has been created by SAS and as such I am not able to share that with you.

In the absence of being able to reproduce the error in your lab, do you have any thoughts on what could be causing the error?

Thank you for your time and effort!

Hi,

Do you meet the GPU lockup?

It means that if you manually killed all the process that uses GPU.
The GPU load (with tegrastats) still reports 100% utilization and no GPU jobs can be done in this status.

Thanks.

AastaLLL,

the GPU is not locked up. In our experience, we have 3 docker images running that each use the GPU. When we have experienced the error, it only impacted one docker container. Stopping and restarting the docker container recovers the impacted container that reports the CUDA error.

Hi,

Which JetPack version do you use?

In our experience, the topic you shared will cause the GPU to hang (lockup as mentioned above).
If the GPU can run normally after the crash, this might be some user-level issue.

For example, the memory is not available for allocating or some invalid access.

Thanks.

Thank you AastaLLL for the continued support!

We are running on Jetson AGX with Jetpack 4.5 (L4T: 32.5.0) and another Jetson AGX with Jetpack 4.6 (L4T: 32.6.1) - running our application on both devices resulted in CUDA errors after some time. The behavior observed was (for the device running Jetpack 4.5):

  1. Our 3 containers would be running simultaneously without issue, utilizing the Jetson Xavier AGX on JetPack 4.5
  2. We noticed the NVIDIA GPU lockup error mentioned above, in the syslog
  3. We noticed CUDA errors shortly after (2), in our application’s log. We noticed this for one out of the three containers.
  4. The container that received the CUDA error, only recovers after stopping and starting the container.
  5. The other containers continue to function and run the application without issue.

Does this align with your definition of the GPU running “normally”? Let me know, thanks.

Hi,

It looks like a user space error only.

There are other containers that can run normally when the issue happens.
This indicates that the issue doesn’t from GPU itself. GPU can receive and finish jobs as usual.

The error container can restart the application without resetting the GPU (ex. reboot).
This also indicates that the crash is in the user space, not from the GPU firmware.

Thanks.