# DeepStream 7.1 — `gst-nvinfer` `cudaErrorIllegalAddress` under sustained inference load on Jetson Orin Nano (JetPack 6.2 / L4T 36.4.x)
**Status**: Open, seeking guidance
**Product context**: commercial AI vision totem currently in pilot deployment, built on Jetson Orin Nano + DeepStream
**Date**: 2026-05-05
—
## 1. Summary
Under sustained per-frame inference load, our DeepStream pipeline crashes with
`cudaErrorIllegalAddress` (CUDA error 700) inside `gst-nvinfer`'s
`NvDsInferContextImpl::releaseBatchOutput`, followed by an unrecoverable CUDA
context corruption that cascades into `SIGSEGV` and process termination.
The pipeline is automatically restarted by our process supervisor, the bug
re-fires within 30-90 s of resumption.
We have substantially reduced the rate by throttling `nvinfer.interval`, but
the underlying race / double-release appears to persist. We would like
guidance on whether this matches a known issue in DeepStream 7.1 and whether
a patched build or supported workaround exists.
## 2. Environment
| Item | Value |
|—|—|
| Hardware | NVIDIA Jetson Orin Nano Super (8 GB) Developer Kit |
| Power profile | `MAXN SUPER` (`nvpmodel -m 0`), official 19V/4.74A barrel-jack PSU |
| Storage | NVMe SSD, ext4, 18% utilised, no I/O errors |
| L4T | 36.4.x — currently a mixed state: most packages at 36.4.4-20250616085344, three packages (`nvidia-l4t-gstreamer`, `nvidia-l4t-jetson-multimedia-api`, `nvidia-l4t-libwayland-egl1`) at 36.4.7-20250918154033. `libnvbufsurftransform` (in `nvidia-l4t-multimedia`) is at 36.4.4 alongside the CUDA runtime. |
| CUDA | 12.6.68 |
| TensorRT | bundled with JetPack 6.2 (10.x) |
| DeepStream | 7.1.0-1 |
| Custom YOLO parser | `libnvdsinfer_custom_impl_Yolo.so` (DeepStream-Yolo, NVIDIA-AI-IOT compatible) |
| Model | YOLOv8 PPE-detector, 8 classes, FP16, 640×640 input, exported to TensorRT engine |
| Camera | Logitech C920 HD Pro (UVC, 1280×720 @ 30 fps MJPEG over USB 2) |
| Tracker | Tested both `config_tracker_IOU.yml` (currently in use) and `config_tracker_NvDCF_perf.yml` (worse — see §6) |
## 3. Pipeline topology
```
v4l2src (1280x720,MJPEG)
→ capsfilter (image/jpeg, 1280x720, 30/1)
→ jpegdec
→ videoconvert
→ tee
├── queue (leaky=2, max-buffers=1)
│ -> nvvideoconvert
│ -> capsfilter (video/x-raw(memory:NVMM), format=NV12, width=640, height=360)
│ -> nvstreammux (batch-size=1, live-source=1, width=640, height=360,
│ batched-push-timeout=1000)
│ -> nvinfer (interval=2, FP16, custom YOLO parser, cluster-mode=2,
│ maintain-aspect-ratio=1, symmetric-padding=1)
│ -> nvtracker (IOU)
│ -> fakesink
└── queue (leaky=2, max-buffers=1)
-> capsfilter (video/x-raw, I420)
-> jpegenc (preview branch, system memory only)
-> appsink (max-buffers=1, drop=true, sync=false)
```
The MJPEG preview branch is deliberately kept in system memory (no
`nvvideoconvert`) so that NVMM access is single-consumer. The capsfilter
between `nvvideoconvert` and `nvstreammux` pins width/height to the
streammux profile so streammux does not have to invoke its internal
`nvbufsurftransform` resize on every buffer.
## 4. Crash signature (full sequence)
A representative crash from `aegis-error.log`, reproduced verbatim. The same
sequence has been captured >50 times across dozens of pipeline restarts.
```
ERROR: [TRT]: [cudaDriverHelpers.cpp::operator()::106] Error Code 1: Cuda Driver
(an illegal memory access was encountered)
ERROR: cudaStreamDestroy failed, cuda err_no:700, err_str:cudaErrorIllegalAddress
ERROR: cudaStreamDestroy failed, cuda err_no:700, err_str:cudaErrorIllegalAddress
ERROR: cudaEventDestroy failed, cuda err_no:700, err_str:cudaErrorIllegalAddress
ERROR: cudaEventDestroy failed, cuda err_no:700, err_str:cudaErrorIllegalAddress
ERROR: cudaFree failed, cuda err_no:700, err_str:cudaErrorIllegalAddress
ERROR: cudaFreeHost failed, cuda err_no:700, err_str:cudaErrorIllegalAddress
ERROR: [TRT]: createInferRuntime: Error Code 6: API Usage Error
(CUDA initialization failure with error: 700)
ERROR: [TRT]: [checkMacros.cpp::catchCudaError::212] Error Code 1: Cuda Runtime
(an illegal memory access was encountered)
[process exits via SIGSEGV]
```
In a small number of cases we have also captured the warmup-period variant:
```
WARN: nvinfer gstnvinfer.cpp:2461 gst_nvinfer_output_loop:
error: Failed to dequeue output from inferencing.
NvDsInferContext error: NVDSINFER_CUDA_ERROR
WARN: nvinfer gstnvinfer.cpp:681 NvDsInferContext[UID 1]:
Warning from NvDsInferContextImpl::releaseBatchOutput()
<nvdsinfer_context_impl.cpp:1990> \[UID = 1\]:
Tried to release an outputBatchID which is already with the context
ERROR: nvinfer gstnvinfer.cpp:1267 get_converted_buffer:
cudaMemset2DAsync failed with error cudaErrorIllegalAddress
while converting buffer
WARN: nvinfer gstnvinfer.cpp:1576 gst_nvinfer_process_full_frame:
error: Buffer conversion failed
/dvs/git/dirty/git-master_linux/nvutils/nvbufsurftransform/nvbufsurftransform_copy.cpp:341:
=> Failed in mem copy
ERROR: [TRT]: IExecutionContext::enqueueV3: Error Code 1:
Cask (Cask convolution execution)
```
The `Tried to release an outputBatchID which is already with the context`
line is what we believe to be the proximate cause — a double-release inside
`NvDsInferContextImpl::releaseBatchOutput`. Once that fires, every subsequent
CUDA call in the process returns `cudaErrorIllegalAddress` until the process
exits.
## 5. Reproduction
1. Boot Jetson into MAXN SUPER, mount the C920, start the pipeline above.
2. Stand a person in front of the camera so the YOLO model produces sustained
detections (≥ 1 object per frame).
3. Within 30-90 s, the crash sequence in §4 fires and the process exits.
We have reproduced this on:
* PIXY USB camera at 30 fps (where it fired less often, ~1 per 13 hours
overnight, because the camera frequently dropped frames and accidentally
protected the buffer pool).
* Logitech C920 at 30 fps (where reliable frame delivery exposes the bug
within ~50-90 s under active person load).
* Both `interval=0` (every frame; crashes within ~14 s) and `interval=1`
(every other frame; crashes within ~50-90 s under load).
## 6. Mitigations attempted
| Change | Effect |
|—|—|
| Pin `width`/`height` in the NVMM caps before streammux so streammux does not invoke its internal resize | Significantly reduced crash rate but did not eliminate. Crash site moved from streammux’s resize to `gst-nvinfer`'s internal converter. |
| `nvinfer interval=2` (10 Hz inference) instead of `interval=0` | Reduced rate from “within 14 s” to “within 50-90 s under load” (~30× improvement). Required to keep system usable. |
| Replace IOU tracker with `NvDCF_perf` | **Made it worse.** NvDCF crashed within ~50 s with `gstnvtracker: Low-level tracker lib returned error 1` and on restart `gstnvtracker: Failed to create cuda stream for buffer conversion: cudaErrorIllegalAddress`. The CUDA context was unrecoverable until full process restart. |
| Process-supervisor restart on SIGSEGV (pm2 fork mode, `min_uptime=10s`, `restart_delay=2000`) | Restores service in ~5 s but is not an acceptable production posture for our use case (totem drops video for 5-10 s every restart). |
| Multi-layer in-process watchdog (15 s no-frames → in-process pipeline restart, 2 attempts → escalate to process restart) | Recovers from the milder warmup-period variant. Does not recover from the full `cudaErrorIllegalAddress` cascade because the CUDA context cannot be re-initialised in-process. |
## 7. What we are asking
In priority order:
1. **Is this a known issue in DeepStream 7.1’s `gst-nvinfer`?** Specifically a
double-release race in `NvDsInferContextImpl::releaseBatchOutput` under
sustained 1-source, 1-batch inference on Jetson Tegra. Our error
signatures are reproducible and identical across runs.
2. **Is there a patched DeepStream build available** (DS 7.1.x maintenance
release, DS 7.2 preview, internal patch) that fixes the race?
3. **Is there a supported pipeline configuration that avoids the bug?** For
example: `nvbuf-memory-type` setting, alternative buffer-pool size, an
`nvstreammux` config we have not tried, or a recommendation against
`tee` with `nvinfer` on the same source on Tegra.
4. **Should we move to JetPack 6.x.y newer** (e.g. r36.4.7 across the board
instead of our current mixed 36.4.4 / 36.4.7 state) before further
investigation? We have left runtime libs (`libnvbufsurftransform`,
`libcuda`, CUDA 12.6 runtime) at 36.4.4 to maintain ABI consistency.
5. **As a last resort**: is there a recommended path to bypass `gst-nvinfer`
entirely and call TensorRT directly from a custom appsrc/appsink loop on
Jetson, with an example we can study?
## 8. Artefacts available on request
* Full pm2 / aegis-error / aegis-out logs from a known crash window (~90 MB compressed).
* `nvinfer` config file we generate at runtime.
* GStreamer pipeline `.dot` graph captured at `PLAYING` state.
* `tegrastats` capture across a crash window.
* `dmesg` from boot through a representative crash.
* SQLite dump of our `pipeline_stall_events` and `process_restarts` audit
tables (~280 stall snapshots and ~120 restart records to date).