Questions about the CUDA Runtime

Hello.

I’m encountering the following Tensor & CUDA errors during long-term aging in the environment described below. Do you happen to know the cause or any related information?

[04/15/2026-06:55:53] [E] [TRT] 1: [reformat.cpp::executeCutensor::387] Error Code 1: CuTensor (Internal cuTensor permutate execute failed)

[04/15/2026-06:55:53] [E] [TRT] 1: [checkMacros.cpp::catchCudaError::202] Error Code 1: Cuda Runtime (an async error has occurred in an external entity outside of CUDA)

Operating Environment

  • Jetson Orin NX 8G

  • JetPack 5.1.2

  • custom board

  • Inference using a combination of 1 GPU model + 3 DLA models

Hi,

Could you share more information about the issue?

  1. How long does it take to reproduce the issue?
  2. Is the system hung, rebooted, or just a user space application crash?
  3. Is there any error in the dmesg?
$ sudo dmesg

Thanks.

Answer.

1. The recurrence interval is random. It can occur as soon as 1–2 days, or take up to about a week.

2. When the symptom occurs, only the inference process fails. There are no system hangs or application crashes.

3. Currently, dmesg logs are not being collected, so I cannot attach them. However, when the symptom occurred, there were no notable failure logs in the dmesg.

Hi,

Based on your comment, it sounds more like an application-level issue.
Is there another application running concurrently?
Any possibility that the device is running out of memory?

Thanks.

Hi.

There are other applications running at the same time, but they aren’t performing inference using TensorRT.

They are performing decoding using NVDEC.

According to HTOP, 6GB out of 7.3GB of memory is in use, and Zram is also using several hundred megabytes.

Could this be caused by insufficient memory?

Hi,

Based on the error, the error seems related to the interaction with other applications/libraries:

Error Code 1: Cuda Runtime (an async error has occurred in an external entity outside of CUDA)

We need more logs to figure out the issue.
Could you try to reproduce the problem with the CUDA coredump:
Set the CUDA_ENABLE_COREDUMP_ON_EXCEPTION environment variable to 1 and share the file with us.

Thanks.

Hi.

I will set `export CUDA_ENABLE_COREDUMP_ON_EXCEPTION=1`, attempt to reproduce the issue, and then share the results.

Thanks.

Hello.

I set the environment variables to generate a core dump and tried to reproduce the issue,

but even when the problem occurred, no core dump was generated.

In my opinion, it seems that the issue occurs frequently when excessive requests to the GPU/DLA are repeated while memory is low.

Thanks.

Hi,

We will need to reproduce this issue locally to debug it further.

Do you have dependencies on JetPack 5?
If not, could you try to reproduce the issue on JetPack 6 to see if the hang still occurs?
If yes, could you share a reproducible source and corresponding steps with us?

Thanks.

Hi.

Our custom board has a dependency on Jetpack 5.1.2.

As a result, migrating to Jetpack 6 is not straightforward.

The reproduction process involves decoding 8 streams at 1080p@10fps, extracting results using the GPU detection model, and then passing those results to 3 DLA models as needed to extract further results.

If this process is continuously run, the issue occurs intermittently.

PS. Please understand that we are unable to share the model due to company policy.

Thanks

Hi,

Could you share more about how you share the buffer between DLA and GPU?
Is it possible that the three DLA tasks read the same buffer concurrently?

More precisely, is your usecase has concurrent read/write or a concurrent read scenario?

Thanks.

Hi.

The DLA and GPU read from different buffers and write to different buffers.

A single flow is as follows:

In Buf -> GPU -> Out Buf  -> GPU Result -> New In Buf -> DLA 1 -> New Out Buf ->  DLA Result
                                        -> New In Buf -> DLA 2 -> New Out Buf -> DLA Result
                                        -> New In Buf -> DLA 3 -> New Out Buf -> DLA Result

DLA models are called selectively based on the GPU Result.

The above flow is running indefinitely at 7 fps on 8 threads.

Thanks.

Hi,

Is the selection done by the CPU?
If yes, is there concurrent access between CPU and GPU?

If so, please try to add a synchronization call to avoid concurrent access to see if the issue remains.

Thanks.

Hi.

Before reaching out with this issue, I had already checked the synchronization between the CPU and GPU, and I’ve added synchronization measures to prevent concurrent access where necessary.

In short, the problem still occurs even after adding synchronization.

Thanks.

Hi.
This is yongjun kwon as sw pl .

During aging tests under high-load conditions (complex scenes with many objects), we are encountering the following error:
[E] [TRT] 1: [cudlaUtils.cpp::submit::95] Error Code 1: DLA (Failed to submit program to DLA engine.)
This appears to be caused by DLA resource contention and scheduling bottlenecks when processing multiple models on a single DLA core.

Configuration: 1 GPU model + 3 DLA models (8-channel inference, 7fps per channel)

To prevent DLA queue overflow, we are considering a Dynamic FPS Scaling approach:
Monitor scene complexity (e.g., number of detected objects or DLA processing latency).
If complexity is high, temporarily reduce the input frame rate (e.g., 7fps → 4~5fps).
Restore to 7fps once the scene complexity decreases.

Is this dynamic scaling a recommended approach to mitigate DLA submit errors on Orin NX 8GB?

Hi,

Do you face the same issue as @jp.ko?
Since Orin NX 8GB only has one DLA, can you try to launch it sequentially?

Thanks.

Hi

Right. Same issue.

I added one more error log with previous logs.(this logs also happend with previous logs)

[previous log]

[04/15/2026-06:55:53] [E] [TRT] 1: [reformat.cpp::executeCutensor::387] Error Code 1: CuTensor (Internal cuTensor permutate execute failed)

[04/15/2026-06:55:53] [E] [TRT] 1: [checkMacros.cpp::catchCudaError::202] Error Code 1: Cuda Runtime (an async error has occurred in an external entity outside of CUDA)

[one more log]

[05/02/2026-06:43:48] [E] [TRT] 1: [cudlaUtils.cpp::submit::95] Error Code 1: DLA (Failed to submit program to DLA engine . )

So far, we’ve had it running in parallel, but this time we’d like to change it to run sequentially and test it.

Thanks

Hi.

@yj45.kwon is the software project leader on my team.

I’ve been concerned that simultaneous requests to the DLA might be causing issues, so we’re currently testing a change to limit requests to one at a time.

I’ll share the results once we have them.

Thanks.

Hi, both

Which inference call do you use?

Do you use enqueueV3?
If asychronize call is used, could you switch to executeV2 for a try?

Thanks.

Hi.

I’m already using executeV2.
I’ve also tried enqueueV3, but I’m still encountering the same issue.

Thanks.