Cupti deadlock at cuptiPCSamplingStop

I tried to do remote GPU profiling using CUPTI and grpc. The code looks like as follow:

class GPUProfilingServerImpl final public GPUProfilingServer::Service {
    Status DoProfiling(ServerContext* context, const GPUProfilingRequest* request, GPUProfilingResponse* reply) override {
        // initialize param
        // initialize param

extern "C" InitializeInjection(void) {
    // enable cupti callbacks
    // start grpc server

I compile the code to a dynamic lib and set CUDA_INJECTION64_PATH to the lib path. Then I run a cuda program and issued a request using grpc client, and the deadlock occasionally happened. The gdb debug info was as follows;

#0  __lll_lock_wait (futex=futex@entry=0x55718480bdc8, private=0) at lowlevellock.c:52
#1  0x00007f16bdbcd131 in __GI___pthread_mutex_lock (mutex=0x55718480bdc8) at ../nptl/pthread_mutex_lock.c:115
#2  0x00007f16ba15b292 in ?? () from /usr/local/cuda/lib64/
#3  0x00007f16ba03a746 in ?? () from /usr/local/cuda/lib64/
#4  0x00007f16ba03aa50 in ?? () from /usr/local/cuda/lib64/
#5  0x00007f16ba03b48c in ?? () from /usr/local/cuda/lib64/
#6  0x00007f16bc011495 in ?? () from /lib/x86_64-linux-gnu/
#7  0x00007f16bc21c4a0 in ?? () from /lib/x86_64-linux-gnu/
#8  0x00007f16bbfb528f in ?? () from /lib/x86_64-linux-gnu/
#9  0x00007f16bbfb799f in ?? () from /lib/x86_64-linux-gnu/
#10 0x00007f16bc0591c2 in ?? () from /lib/x86_64-linux-gnu/
#11 0x000055718388712b in __cudart803 ()
#12 0x00005571838e2006 in cudaLaunchKernel ()

I checked the owner of mutex:

(gdb) p *mutex
$1 = {__data = {__lock = 2, __count = 1, __owner = 2937977, __nusers = 1, __kind = 1, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}},
  __size = "\002\000\000\000\001\000\000\000y\324,\000\001\000\000\000\001", '\000' <repeats 22 times>, __align = 4294967298}

And print the call stack of thread 2917977:

#0  0x00007f1690bd910d in ?? () from /usr/local/cuda/lib64/
#1  0x00007f1690a104e9 in ?? () from /usr/local/cuda/lib64/
#2  0x00007f16ba15c01d in ?? () from /usr/local/cuda/lib64/
#3  0x00007f16ba15ff9e in ?? () from /usr/local/cuda/lib64/
#4  0x00007f16ba16034d in ?? () from /usr/local/cuda/lib64/
#5  0x00007f16ba15aca4 in cuptiPCSamplingStop () from /usr/local/cuda/lib64/
#6  0x00007f16bb1ba0cd in stopCUptiPCSamplingHandler (signum=12) at gpu_profiler.cpp:875
#7  0x00007f16bb1bcc54 in GPUProfilingServiceImpl::DoProfiling (this=0x7f16b94225b0, context=0x7f16a0010248, request=0x7f16a000f3a0, reply=0x7f16b27f9380) at gpu_profiler.cpp:915

Anyone knows why? thanks.

fyi, I found that that were cupti calls in the callstack of cudaLaunchKernel , which might be the cause of the deadlock. So I tried to disable the cupti callback before calling cuptiPCSamplingStart/Stop and enable it after. But the situation did not change either.

Hi pkueecsly,

A similar deadlock issue in the cuptiPCSamplingStop call was fixed in the CUDA 11.6 Update 1 release (link). Would it be possible for you to give a try to CUPTI from this release?

And thanks for providing call stack and other relevant details.

Thanks for your reply.
But I checked the installed CUDA version, it was exactly 11.6.1 (driver version 510.47.03).

Hi pkueecsly,

Can you please provide us the CUPTI library version? By default, it is located at /usr/local/cuda/extras/CUPTI/lib64. Is version or

sorry for the late reply.
I checked the CUPTI lib version, it was 2022.1.0. But after I updated it to, the bug still existed.

Hi pkueecsly,

Sorry to hear that issue is not fixed in the CUPTI from CUDA 11.6 Update 1 release. Would it be possible for you to provide a minimal reproducer for us to debug the issue? And what GPU do you use?