All cudaStreamSynchronize() hang in tensorRT thread, run on jetson orin

Hi all,
All my tensorRT threads hang at cudaStreamSynchronize ()
This is the part of my bt in gdb:
Thread 35 (Thread 0xfffe590b5500 (LWP 3604513)):
#0 0x0000ffff88069938 in ioctl () from target:/lib/aarch64-linux-gnu/libc.so.6
#1 0x0000fffeef8d5e98 in ?? () from target:/lib/libnvrm_host1x.so
#2 0x0000fffeefdec0ac in ?? () from target:/lib/libcuda.so.1
#3 0x0000fffeefca2de8 in ?? () from target:/lib/libcuda.so.1
#4 0x0000fffeefcbae34 in ?? () from target:/lib/libcuda.so.1
#5 0x0000fffeefd16f54 in ?? () from target:/lib/libcuda.so.1
#6 0x0000fffeefd220d8 in ?? () from target:/lib/libcuda.so.1
#7 0x0000fffeefd35298 in ?? () from target:/lib/libcuda.so.1
#8 0x0000ffff841b1568 in ?? () from target:/apollo/bazel-bin/modules/rs_perception/trafficlight/tfl_nn/component/…/…/…/…/…/_solib_local/_U@local_Uconfig_Ucuda_S_Scuda_Ccudart___Uexternal_Slocal_Uconfig_Ucuda_Scuda_Scuda_Slib/libcudart.so.11.0
#9 0x0000ffff8420c974 in cudaStreamSynchronize () from target:/apollo/bazel-bin/modules/rs_perception/trafficlight/tfl_nn/component/…/…/…/…/…/_solib_local/_U@local_Uconfig_Ucuda_S_Scuda_Ccudart___Uexternal_Slocal_Uconfig_Ucuda_Scuda_Scuda_Slib/libcudart.so.11.0


Thread 33 (Thread 0xfffe5b1db500 (LWP 3604511)):
#0 0x0000ffff88069938 in ioctl () from target:/lib/aarch64-linux-gnu/libc.so.6
#1 0x0000fffeef8d5e98 in ?? () from target:/lib/libnvrm_host1x.so
#2 0x0000fffeefdec0ac in ?? () from target:/lib/libcuda.so.1
#3 0x0000fffeefca2de8 in ?? () from target:/lib/libcuda.so.1
#4 0x0000fffeefcbae34 in ?? () from target:/lib/libcuda.so.1
#5 0x0000fffeefd16f54 in ?? () from target:/lib/libcuda.so.1
#6 0x0000fffeefd220d8 in ?? () from target:/lib/libcuda.so.1
#7 0x0000fffeefd35298 in ?? () from target:/lib/libcuda.so.1
#8 0x0000ffff841b1568 in ?? () from target:/apollo/bazel-bin/modules/rs_perception/trafficlight/tfl_nn/component/…/…/…/…/…/_solib_local/_U@local_Uconfig_Ucuda_S_Scuda_Ccudart___Uexternal_Slocal_Uconfig_Ucuda_Scuda_Scuda_Slib/libcudart.so.11.0
#9 0x0000ffff8420c974 in cudaStreamSynchronize () from target:/apollo/bazel-bin/modules/rs_perception/trafficlight/tfl_nn/component/…/…/…/…/…/_solib_local/_U@local_Uconfig_Ucuda_S_Scuda_Ccudart___Uexternal_Slocal_Uconfig_Ucuda_Scuda_Scuda_Slib/libcudart.so.11.0

I have 8 stream and Running in parallel in 8 threads,all the 8 threads hang at cudaStreamSynchronize (), and they have same backtrace.
How can I fix it?

thanks

Hi,

How long do you wait for the cudaStreamSynchronize() to finish?

By default, CPU launches GPU tasks without waiting for them to finish.
After calling synchronize, it’s expected that the CPU will block utils all GPU tasks are done.

Thanks.

Hi AastaLLL,
I would call cudaStreamSynchronize() after calling enqueueV3() to wait for the GPU to return.
When running correctly, it will wait about 10-50ms for the result to be returned. But here’s the error, after 10min of waiting it still hangs in the same place.
Thanks.

Hi,

Could you share a sample and steps to reproduce this?
We want to check it further internally.

Thanks.

Hi,
Our test cases contain some trade secrets that make it difficult to provide a simple sample.
But I have some other information, there are 3 tensorRT networks running in these 8 streams, these 3 networks don’t have this problem when they are running in 3 separate processes, but the scheduling is very inefficient, so we merged these 3 networks into the same process, and this glitch occurs.
could you provide some debug suggestions so that we can get more information?

Thanks.

Hi,

Please try to profile the application with our Nsight System.
It should give you some hints about where the freeze happened.

More, could you check if there is any error message in dmesg?

$ sudo dmesg

Thanks

Hi,
I can only see all the thread backtraces, and all GPU infer threads hang at cudaStreamSynchronize (), dmesg has no information when this problem occurs because the program did not quit. And I don’t know how to use nsys to get more information about this bug.
I did some control tests, I tried to reduce the number of infer threads and when I remove any two of them, the problem doesn’t occur.
Is there a limit to the number of streams for tensorRT on jetson orin?

Thanks

Hi,

Could you try to increase the CUDA queue to see if it helps?

$ export CUDA_DEVICE_MAX_CONNECTIONS=32

Below is the introduction of CUDA_DEVICE_MAX_CONNECTIONS for your reference:
https://docs.nvidia.com/deploy/mps/index.html#topic_5_2_4

Thanks.