Segmentation fault in enqueue() when using multithreading

Hi,

I get a segmentation fault when I have 2 threads running one DNN each via TRT on Xavier (DDPX). The segfault happens when both threads reach the nvinfer1::enqueue() exactly at the same time.

Here’s the backtrace of the segfaulted thread:
#0 0x0000007f92cfe14c in ?? () from /usr/lib/aarch64-linux-gnu/libcudnn.so.7
#1 0x0000007f92cfe618 in ?? () from /usr/lib/aarch64-linux-gnu/libcudnn.so.7
#2 0x0000007f92cfe744 in ?? () from /usr/lib/aarch64-linux-gnu/libcudnn.so.7
#3 0x0000007f9356c198 in ?? () from /usr/lib/aarch64-linux-gnu/libcudnn.so.7
#4 0x0000007f9356d924 in ?? () from /usr/lib/aarch64-linux-gnu/libcudnn.so.7
#5 0x0000007f931043f0 in ?? () from /usr/lib/aarch64-linux-gnu/libcudnn.so.7
#6 0x0000007f92b6e658 in ?? () from /usr/lib/aarch64-linux-gnu/libcudnn.so.7
#7 0x0000007f92b6ea38 in ?? () from /usr/lib/aarch64-linux-gnu/libcudnn.so.7
#8 0x0000007f92b6f334 in cudnnConvolutionForward () from /usr/lib/aarch64-linux-gnu/libcudnn.so.7
#9 0x0000007fb23c9ec4 in nvinfer1::rt::cuda::CudnnConvolutionRunner::execute(nvinfer1::rt::CommonContext const&, nvinfer1::rt::ExecutionParameters const&) const () from /usr/lib/aarch64-linux-gnu/libnvinfer.so.6
#10 0x0000007fb21bdb58 in nvinfer1::rt::ExecutionContext::enqueueInternal(CUevent_st**) () from /usr/lib/aarch64-linux-gnu/libnvinfer.so.6
#11 0x0000007fb21c02d0 in nvinfer1::rt::ExecutionContext::enqueue(int, void**, CUstream_st*, CUevent_st**) () from /usr/lib/aarch64-linux-gnu/libnvinfer.so.6

The backtrace of the other thread varies in content. For example:
#0 0x0000007f892cfab4 in ?? () from /usr/lib/libcuda.so.1
#1 0x0000007f892f7188 in ?? () from /usr/lib/libcuda.so.1
#2 0x0000007f892f7440 in ?? () from /usr/lib/libcuda.so.1
#3 0x0000007f8944ef00 in ?? () from /usr/lib/libcuda.so.1
#4 0x0000007f8926e0b8 in ?? () from /usr/lib/libcuda.so.1
#5 0x0000007f8926e280 in ?? () from /usr/lib/libcuda.so.1
#6 0x0000007f8937fefc in cuLaunchKernel () from /usr/lib/libcuda.so.1
#7 0x0000007f8dc782ac in ?? () from /usr/local/cuda-10.2/targets/aarch64-linux/lib/libcudart.so.10.2
#8 0x0000000000000100 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

The segfault happens only, and it’s sometimes in one thread, and sometimes in the other thread. I.e. it’s not always the same thread that segfaults.

For the record, I have followed these guidelines about thread-safety:
https://docs.nvidia.com/deeplearning/tensorrt/best-practices/index.html#thread-safety

I have:

  • Ensured that each thread creates their own execution context (via engine_->createExecutionContext()).
  • Ensured the logger is a global variable and is thread-safe.

What is the problem? How can I debug this further?

I currently cannot share a reproducible example since 1) it’s not perfectly reproducible and 2) it’s proprietary.

Thanks!

1 Like

Could you please check if you are using unique cuda stream on each thread, or sharing the same stream? If 2 host threads sharing the same stream, that might be a race condition.

If possible could you share the sample code that we can use to reproduce the issue so we can help better.

Thanks

Hi,

Thanks! I am indeed using different CUDA streams on each thread, created via cudaStreamCreate. I’ve also tried using default creation, creation non blocking with the default stream, and creation with different priorities (one stream with high prio, the other stream with low prio). None of these helped.

I’ll try and see if I can reproduce it with trtexec.

Thanks!

1 Like

Hi guys,

Any updates? I observe the same problem on Windows with 2080Ti.

1 Like