Cupti activity tracer hangs at cuptiActivityFlushAll when tracing pytorch models

I have a cupti tracer using activity api. It works well on cuda/tensorflow however it hangs when i try to run a pytorch model. It hangs when trying to call cuptiActivityFlushAll.

Here is where it seems to hang

* frame #0: 0x00007fff052b96e6 libpthread.so.0`do_futex_wait at futex-internal.h:205
    frame #1: 0x00007fff052b96be libpthread.so.0`do_futex_wait(sem=0x000000000addd940, abstime=0x0000000000000000) at sem_waitcommon.c:111
    frame #2: 0x00007fff052b97d8 libpthread.so.0`__new_sem_wait_slow(sem=0x000000000addd940, abstime=0x0000000000000000) at sem_waitcommon.c:181
    frame #3: 0x00007fff01afd88f libcupti.so`___lldb_unnamed_symbol3808$$libcupti.so + 239
    frame #4: 0x00007fff0197632e libcupti.so`cuptiActivityFlushAll + 526
......

Using a debugger i can see another thread waiting here

frame #0: 0x00007fff052b69f3 libpthread.so.0`__pthread_cond_wait at futex-internal.h:88
    frame #1: 0x00007fff052b69d8 libpthread.so.0`__pthread_cond_wait at pthread_cond_wait.c:502
    frame #2: 0x00007fff052b68f8 libpthread.so.0`__pthread_cond_wait(cond=0x00000000170b0850, mutex=0x00000000170b0828) at pthread_cond_wait.c:655
    frame #3: 0x00007fff008afbdd libcuda.so.1`___lldb_unnamed_symbol3770$libcuda.so.1 + 253
    frame #4: 0x00007fff0084e653 libcuda.so.1`___lldb_unnamed_symbol2219$libcuda.so.1 + 147
    frame #5: 0x00007fff008aee18 libcuda.so.1`___lldb_unnamed_symbol3727$libcuda.so.1 + 40
    frame #6: 0x00007fff052b06db libpthread.so.0`start_thread(arg=0x00007ffef4b48700) at pthread_create.c:463
    frame #7: 0x00007fff0482fb2f libc.so.6`__GI___clone at clone.S:95

Is there anything that might be causing this? I’d love some pointers to help me debug this.The tracer and the gpu workload run in the same process.

Thank you,
Sujan

Hi Sujan,

What CUDA toolkit and GPU you are using? If you are using an older toolkit, please check if the issue reproduces on the recent toolkits CUDA 11.0 or 10.2. One more experiment which can be tried out is use the force flush by passing 1 as the flag to the API cuptiActivityFlushAll.

It’s difficult to identify the issue from the callstack. Would it be possible for you to give a minimal reproducer?

Turns out I had a bug that freed the CUPTI buffer before CUPTI handed out the buffer through buffer_returned callback.
That caused cuptiActivityFlushAll to hang.
Fixing the buf fixed the issue.