AppDecPerf got stuck in cuCtxCreate when set more than 25 threads


I am using RTX A6000 on Orin IGX platform for developing multi-channel decoding with video codec sdk. I tried to test AppDecPerf to evaluate the performance. However, AppDecPerf got stuck in cuCtxCreate if set more than 25 threads with command “./AppDecPerf -i Test.avc -thread 25 -host”. The number of successful cuCtxCreate before getting stuck is random between 19~22 for every run of AppDecPerf. Is there any limitation on HW decoding of RTX A6000? Thanks a lot for any help.

The following is the information.

Platform: Orin IGX
Graphic card: RTX A6000
Kernel: Ubuntu 5.10.104-tegra
Command: ./AppDecPerf -i Test.avc -thread 25 -host
CUDA version: 11.8
Driver: 520.61.05


| NVIDIA-SMI 520.61.05 Driver Version: 520.61.05 CUDA Version: 11.8 |
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
| 0 NVIDIA RTX A6000 On | 00000005:06:00.0 On | 0 |
| 30% 41C P2 76W / 300W | 5903MiB / 46068MiB | 0% Default |
| | | N/A |

| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
| 0 N/A N/A 1799 G /usr/lib/xorg/Xorg 105MiB |
| 0 N/A N/A 2299 G /usr/bin/gnome-shell 105MiB |
| 0 N/A N/A 2949 C ./AppDecPerf 5686MiB |

Hi there,

every context created will consume a certain amount of memory depending on the CUDA task, this includes GPU and host memory. Decoding video causes quite high requirements. So naturally there will be a limit of how many contexts can safely be created. Since this is a only a sample app, only limited safeguards are built in to avoid possible memory issues.

The purpose of the app is to check the actual performance as you intended, but not necessarily to check the memory limits, so using higher resolution test videos or more demanding codecs with lower thread count might give you a better understanding of the performance you can expect.

A couple of notes that might help:

  • NVDEC Application Note - NVIDIA Docs shows a rough comparison of max decode framerates for certain codecs on certain GPUs. That will give you an indicator of how many parallel streams you might be able to decode
  • You could try removing the `-host’ command line argument, since the IGX might limit the possible threads in this case due to memory constraints.
  • I think to test multi-channel decoding the sample app AppDecMultiFiles might be better suited

I hope this allows you to progress.


Hi Markus,

I am sorry I didn’t describe the details of our use case. Our use case is to use A6000 on Orin IGX to simultaneously real-time decode 80 channels 1920x1080@10fps H264. The source of each channel is rtsp stream.

In order to simultaneously decode 80 channels, we create 80 processes at the same time in our application. Each process has one decoder to decode 1920x1080@10fps H264.

I tried to use a script to create 80 processes of sample app AppDecLowLatency to simulate our use case. The following code is the script that I used.
for (( i=1; i<=80; i++))
sleep 0.5
./AppDecLowLatency -i Test_nv12.avc -o test$i.yuv &

I also add some messages to trace the stuck problem.
printf(“Create ctx begin %d\n”, getpid()); //test
createCudaContext(&cuContext, iGpu, 0);
printf(“\033[0;32;31m”“Create ctx end ctx %d, %d\n”“\033[m”, cuContext, getpid()); //test

There are also some processes get stuck in cuCtxCreate. However, the stuck processes are not the latest processes.

For example, if the stuck process id is 3829. There’s no message “Create ctx end” of process 3829.

A part of message:

Create ctx begin 3824
GPU in use: NVIDIA RTX A6000
Create ctx end ctx -344900912, 3824
Create ctx begin 3829
GPU in use: NVIDIA RTX A6000
Create ctx begin 3833
GPU in use: NVIDIA RTX A6000
Create ctx end ctx -182117680, 3833

Stuck process from nvidia-smi:

| 0 N/A N/A 3829 C ./AppDecLowLatency 10MiB |

Because other processes can be created after the process that gets stuck, it doesn’t look like a out of memory issue. Could you please help on this issue? Any help would be much appreciated.