Multiple Celery Worker: cuStreamSynchronize failed: an illegal memory access was encountered

Hi, I’m currently developing a distributed based inference service with python celery and tensorrt. I’m having this intermittent illegal memory access issue (cuStreamSynchronize failed: an illegal memory access was encountered) when multiple workers (dockerized) are spawned in their own docker container and a tensorrt model is initialized by the worker. I have done a cuda-memcheck on the code and realize its coming from BatchedNMSDynamicPlugin. Attached is the cuda-memcheck log below. Any help is appreciated

========= CUDA-MEMCHECK
========= Invalid shared write of size 2
========= at 0x00006a80 in void cub::DeviceSegmentedRadixSortKernel<cub::DeviceRadixSortPolicy<__half, int, int>::Policy700, bool=1, bool=1, __half, int, int const *, int>(cub::DeviceRadixSortPolicy<__half, int, int>::Policy700 const *, cub::DeviceSegmentedRadixSortKernel<cub::DeviceRadixSortPolicy<__half, int, int>::Policy700, bool=1, bool=1, __half, int, int const , int>, bool=1 const *, cub::DeviceSegmentedRadixSortKernel<cub::DeviceRadixSortPolicy<__half, int, int>::Policy700, bool=1, bool=1, __half, int, int const *, int>, bool=1, cub::DeviceSegmentedRadixSortKernel<cub::DeviceRadixSortPolicy<__half, int, int>::Policy700, bool=1, bool=1, __half, int, int const *, int>, int, int, int)
========= by thread (128,0,0) in block (79,0,0)
========= Address 0x0001c970 is out of bounds
========= Device Frame:void cub::DeviceSegmentedRadixSortKernel<cub::DeviceRadixSortPolicy<__half, int, int>::Policy700, bool=1, bool=1, __half, int, int const *, int>(cub::DeviceRadixSortPolicy<__half, int, int>::Policy700 const *, cub::DeviceSegmentedRadixSortKernel<cub::DeviceRadixSortPolicy<__half, int, int>::Policy700, bool=1, bool=1, __half, int, int const , int>, bool=1 const *, cub::DeviceSegmentedRadixSortKernel<cub::DeviceRadixSortPolicy<__half, int, int>::Policy700, bool=1, bool=1, __half, int, int const *, int>, bool=1, cub::DeviceSegmentedRadixSortKernel<cub::DeviceRadixSortPolicy<__half, int, int>::Policy700, bool=1, bool=1, __half, int, int const *, int>, int, int, int) (void cub::DeviceSegmentedRadixSortKernel<cub::DeviceRadixSortPolicy<__half, int, int>::Policy700, bool=1, bool=1, __half, int, int const *, int>(cub::DeviceRadixSortPolicy<__half, int, int>::Policy700 const *, cub::DeviceSegmentedRadixSortKernel<cub::DeviceRadixSortPolicy<__half, int, int>::Policy700, bool=1, bool=1, __half, int, int const *,
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame:/usr/local/cuda/compat/lib/libcuda.so.1 (cuLaunchKernel + 0x2b8) [0x222728]
========= Host Frame:/usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.11.0 [0x1102b]
========= Host Frame:/usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.11.0 (cudaLaunchKernel + 0x1c0) [0x5a820]
========= Host Frame:/usr/lib/x86_64-linux-gnu/libnvinfer_plugin.so.7 [0xabcea]
========= Host Frame:/usr/lib/x86_64-linux-gnu/libnvinfer_plugin.so.7 [0xb0de7]
========= Host Frame:/usr/lib/x86_64-linux-gnu/libnvinfer_plugin.so.7 [0xb1e31]
========= Host Frame:/usr/lib/x86_64-linux-gnu/libnvinfer_plugin.so.7 [0x8e6b1]
========= Host Frame:/usr/lib/x86_64-linux-gnu/libnvinfer_plugin.so.7 (_ZN8nvinfer16plugin23BatchedNMSDynamicPlugin7enqueueEPKNS_16PluginTensorDescES4_PKPKvPKPvS9_P11CUstream_st + 0x70) [0x43820]
========= Host Frame:/usr/lib/x86_64-linux-gnu/libnvinfer.so.7 (_ZNK8nvinfer12rt4cuda24PluginV2DynamicExtRunner7executeERKNS0_13CommonContextERKNS0_19ExecutionParametersE + 0x332) [0xc076f2]
========= Host Frame:/usr/lib/x86_64-linux-gnu/libnvinfer.so.7 (_ZN8nvinfer12rt16ExecutionContext15enqueueInternalEPP10CUevent_st + 0x4bc) [0xb8af3c]
========= Host Frame:/usr/lib/x86_64-linux-gnu/libnvinfer.so.7 (_ZN8nvinfer12rt16ExecutionContext7enqueueEiPPvP11CUstream_stPP10CUevent_st + 0x3a4) [0xb8d584]
========= Host Frame:/usr/local/lib/python3.8/dist-packages/tensorrt/tensorrt.so [0x74666]
========= Host Frame:/usr/local/lib/python3.8/dist-packages/tensorrt/tensorrt.so [0xc6fcd]
========= Host Frame:/usr/bin/python (PyCFunction_Call + 0x59) [0x1f4249]
========= Host Frame:/usr/bin/python (_PyObject_MakeTpCall + 0x296) [0x1f46d6]
========= Host Frame:/usr/bin/python [0x10a9a3]
========= Host Frame:/usr/bin/python (_PyEval_EvalFrameDefault + 0x1901) [0x16c451]
========= Host Frame:/usr/bin/python (_PyEval_EvalCodeWithName + 0x26a) [0x16955a]
========= Host Frame:/usr/bin/python (_PyFunction_Vectorcall + 0x393) [0x1f7323]
========= Host Frame:/usr/bin/python (_PyEval_EvalFrameDefault + 0x849) [0x16b399]
========= Host Frame:/usr/bin/python (_PyEval_EvalCodeWithName + 0x26a) [0x16955a]
========= Host Frame:/usr/bin/python (_PyFunction_Vectorcall + 0x393) [0x1f7323]
========= Host Frame:/usr/bin/python (_PyEval_EvalFrameDefault + 0x849) [0x16b399]
========= Host Frame:/usr/bin/python (_PyEval_EvalCodeWithName + 0x26a) [0x16955a]
========= Host Frame:/usr/bin/python (_PyFunction_Vectorcall + 0x393) [0x1f7323]
========= Host Frame:/usr/bin/python (_PyEval_EvalFrameDefault + 0x849) [0x16b399]
========= Host Frame:/usr/bin/python (_PyEval_EvalCodeWithName + 0x26a) [0x16955a]
========= Host Frame:/usr/bin/python (_PyFunction_Vectorcall + 0x393) [0x1f7323]
========= Host Frame:/usr/bin/python [0x19c95d]
========= Host Frame:/usr/bin/python (_PyObject_MakeTpCall + 0x1ff) [0x1f463f]
========= Host Frame:/usr/bin/python (_PyEval_EvalFrameDefault + 0x5969) [0x1704b9]
========= Host Frame:/usr/bin/python (_PyEval_EvalCodeWithName + 0x26a) [0x16955a]
========= Host Frame:/usr/bin/python (_PyFunction_Vectorcall + 0x393) [0x1f7323]
========= Host Frame:/usr/bin/python [0x10a24c]
========= Host Frame:/usr/bin/python (PyObject_Call + 0x62) [0x1f3d42]
========= Host Frame:/usr/bin/python (_PyEval_EvalFrameDefault + 0x1f42) [0x16ca92]
========= Host Frame:/usr/bin/python (_PyEval_EvalCodeWithName + 0x26a) [0x16955a]
========= Host Frame:/usr/bin/python (_PyFunction_Vectorcall + 0x393) [0x1f7323]
========= Host Frame:/usr/bin/python (_PyEval_EvalFrameDefault + 0x71e) [0x16b26e]
========= Host Frame:/usr/bin/python (_PyFunction_Vectorcall + 0x1b6) [0x1f7146]
========= Host Frame:/usr/bin/python (_PyEval_EvalFrameDefault + 0x71e) [0x16b26e]
========= Host Frame:/usr/bin/python (_PyEval_EvalCodeWithName + 0x26a) [0x16955a]
========= Host Frame:/usr/bin/python (_PyFunction_Vectorcall + 0x393) [0x1f7323]
========= Host Frame:/usr/bin/python (_PyEval_EvalFrameDefault + 0x849) [0x16b399]
========= Host Frame:/usr/bin/python (_PyEval_EvalCodeWithName + 0x26a) [0x16955a]
========= Host Frame:/usr/bin/python (_PyFunction_Vectorcall + 0x393) [0x1f7323]
========= Host Frame:/usr/bin/python (_PyEval_EvalFrameDefault + 0x5736) [0x170286]
========= Host Frame:/usr/bin/python (_PyEval_EvalCodeWithName + 0x26a) [0x16955a]
========= Host Frame:/usr/bin/python (_PyFunction_Vectorcall + 0x393) [0x1f7323]
========= Host Frame:/usr/bin/python [0x10a769]
========= Host Frame:/usr/bin/python (PyObject_Call + 0x62) [0x1f3d42]
========= Host Frame:/usr/bin/python (_PyEval_EvalFrameDefault + 0x1f42) [0x16ca92]
========= Host Frame:/usr/bin/python (_PyEval_EvalCodeWithName + 0x26a) [0x16955a]

Environment

**TensorRT Version: 7.2.2
**GPU Type: V100
**Nvidia Driver Version: 460.27.04
**CUDA Version: 11.2
**CUDNN Version: -
**Operating System + Version: RHEL7.9
**Python Version (if applicable): Python 3.8
**TensorFlow Version (if applicable): -
**PyTorch Version (if applicable): -
**Baremetal or Container (if container which image + tag): nvcr.io/nvidia/tensorrt:21.02-py3

Hi,
The below link might be useful for you
https://docs.nvidia.com/deeplearning/tensorrt/best-practices/index.html#thread-safety
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#stream-priorities
https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__STREAM.html
For multi threading/streaming, will suggest you to use Deepstream or TRITON
For more details, we recommend you to raise the query to the Deepstream or TRITON forum.

Thanks!