Pthread_rwlock_rdlock is taking very long during TRT ExecutionContext::enqueue

Hi, I am profiling my trt engine with Nsight System and I noticed some random spikes. All these spikes are completely random. I noticed most of these spikes have a very long pthread_rwlock_rdlock in the OS runtime libraries and this will cause the GPU to go IDLE for awhile.

For example:

I am using multiple processes with more multiple GPUs and I have process locks to ensure processes will not call any CUDA API at the same time on a single GPU. However, I am having this long read lock shown above which causes a huge increase in the latency (21ms compared to the rest 13.8ms)

May I know the reason and how can I solve this problem?

Hi team Nvidia, any insights on this?

Hi @zukko,

The pthread_rwlock_rdlock stall you’re seeing in Nsight Systems is happening inside the NVIDIA CUDA driver itself not in your application code. The CUDA driver uses internal read-write locks to protect its context state. A kernel launch acquires a read-lock, but it gets blocked when another thread or driver operation holds a write-lock.

Most Likely Cause: Lazy CUDA Module Loading

The most common cause of random, intermittent spikes exactly like yours is lazy kernel module loading. By default, CUDA defers loading kernel modules until first use. When a new module loads mid-inference, the driver acquires a write-lock, stalling all concurrent kernel launches until it’s done.

Fix this first : set this in your environment before launching your application:

# Add to your ~/.bashrc or set in the shell before running your executable
export CUDA_MODULE_LOADING=EAGER

Then reload your shell or run source ~/.bashrc. This forces all modules to load upfront at context initialization, eliminating mid-inference write-lock surprises. This is the most targeted fix for the pattern you’re describing.

Other Write-Lock Triggers to Investigate

If the spike persists, look for these during the 21ms window in Nsight Systems:

  • Dynamic allocations: Any cudaMalloc, cudaFree, or cudaHostAlloc call during inference triggers a driver write-lock. All buffers must be pre-allocated before the inference loop.
  • Cold-start JIT compilation: TensorRT may JIT-compile certain kernels on the first few enqueue calls. Always warm up your engine with a few dummy runs before measuring latency.
  • Cross-process driver synchronization: In your multi-process setup, even with per-GPU process isolation, the NVIDIA kernel module is shared across all processes. Driver-level operations in one process can still create write-lock contention visible in another.

Diagnostic tip: In Nsight Systems, during the 21ms spike, check the other processes/threads in the timeline. If you see any memory operation or module activity elsewhere at that exact moment, that’s your write-lock holder.

Longer-Term: Consider NVIDIA MPS

If you have multiple processes sharing a single GPU, Multi-Process Service (MPS) is the correct architectural solution. It merges multiple processes into a single CUDA context, eliminating context-switch overhead and reducing driver-level lock contention significantly. However, if your processes are already strictly one-process-per-GPU, MPS is less impactful here. focus on CUDA_MODULE_LOADING=EAGER first.

Recommended Action Order

  1. Set CUDA_MODULE_LOADING=EAGER in your shell environment and re-profile
  2. Audit your inference loop for any dynamic CUDA allocations
  3. Add engine warmup runs before benchmarking
  4. Use Nsight Systems to identify what’s holding the write-lock during the spike

Let us know what Nsight shows during the spike. happy to dig deeper!

Hey @athkumar

Thank you so much for reply. I have a follow up question. You mentioned that the CUDA by default defers loading the kernel modules until first use but the random spikes I have seen so far are happening in the middle of the loop inference. I already have a warm up of 20 iterations, I wonder if export CUDA_MODULE_LOADING=EAGER still can solve this issue?