CUPTI blocks CudaLaunch in multithreaded code

I am using CUDA 4.1 with CUPTI on Tesla C2070.

The code has 2 threads. The first thread launches a long kernel and waits on cudaDeviceSynchronize(), then the second thread starts a small kernel.

I have subscribed for CUPTI_RUNTIME_TRACE_CBID_cudaConfigureCall_v3020 and UPTI_RUNTIME_TRACE_CBID_cudaLaunch_v3020.

This causes the launch of the second kernel to be blocked until the first thread has finished cudaDeviceSynchronize(). Second thread does not return from the cudaConfigureCall() until the first thread finishes cudaDeviceSynchronize().

If I do not subscribe with CUPTI, this does not happen. This looks like a nasty performance bug with CUPTI.

The call stack below shows the status of each thread. I have attached the code with this post.

(gdb) info threads
4 Thread 0x7f731467c710 (LWP 29708) 0x00000037f4ada083 in select () from /lib64/libc.so.6
3 Thread 0x7f7312b50710 (LWP 29709) 0x00007f7314d7e3a6 in ?? () from /usr/lib64/libcuda.so.1
2 Thread 0x7f731214f710 (LWP 29710) 0x00000037f4ac88d7 in sched_yield () from /lib64/libc.so.6

  • 1 Thread 0x7f731477e720 (LWP 29707) 0x00000037f520803d in pthread_join () from /lib64/libpthread.so.0
    (gdb) thread 2
    [Switching to thread 2 (Thread 0x7f731214f710 (LWP 29710))]#0 0x00000037f4ac88d7 in sched_yield () from /lib64/libc.so.6
    (gdb) bt
    #0 0x00000037f4ac88d7 in sched_yield () from /lib64/libc.so.6
    #1 0x00007f73149fb73c in ?? () from /usr/local/cuda/extras/CUPTI/lib64/libcupti.so.4
    #2 0x00007f7314dabac3 in ?? () from /usr/lib64/libcuda.so.1
    #3 0x00007f7314db1020 in ?? () from /usr/lib64/libcuda.so.1
    #4 0x00007f73147bbee8 in cudaConfigureCall () from /usr/local/cuda/lib64/libcudart.so.4
    #5 0x000000000040110f in Thread2 () at event_sampling.cu:121
    #6 0x00000037f52077e1 in start_thread () from /lib64/libpthread.so.0
    #7 0x00000037f4ae152d in clone () from /lib64/libc.so.6
    (gdb) thread 3
    [Switching to thread 3 (Thread 0x7f7312b50710 (LWP 29709))]#0 0x00007f7314d7e3a6 in ?? () from /usr/lib64/libcuda.so.1
    (gdb) bt
    #0 0x00007f7314d7e3a6 in ?? () from /usr/lib64/libcuda.so.1
    #1 0x00007f7314d36b5a in ?? () from /usr/lib64/libcuda.so.1
    #2 0x00007f7314d08976 in ?? () from /usr/lib64/libcuda.so.1
    #3 0x00007f7314d396a3 in ?? () from /usr/lib64/libcuda.so.1
    #4 0x00007f7314d39a06 in ?? () from /usr/lib64/libcuda.so.1
    #5 0x00007f7314d08a29 in ?? () from /usr/lib64/libcuda.so.1
    #6 0x00007f7314cfb830 in ?? () from /usr/lib64/libcuda.so.1
    #7 0x00007f7314cdafa4 in ?? () from /usr/lib64/libcuda.so.1
    #8 0x00007f731478ea13 in ?? () from /usr/local/cuda/lib64/libcudart.so.4
    #9 0x00007f73147c3827 in cudaDeviceSynchronize () from /usr/local/cuda/lib64/libcudart.so.4
    #10 0x0000000000400fe2 in Thread1 (ip=0x0) at event_sampling.cu:101
    #11 0x00000037f52077e1 in start_thread () from /lib64/libpthread.so.0
    #12 0x00000037f4ae152d in clone () from /lib64/libc.so.6
    (gdb) thread 4
    [Switching to thread 4 (Thread 0x7f731467c710 (LWP 29708))]#0 0x00000037f4ada083 in select () from /lib64/libc.so.6
    (gdb) bt
    #0 0x00000037f4ada083 in select () from /lib64/libc.so.6
    #1 0x00007f731524147b in ?? () from /usr/lib64/libcuda.so.1
    #2 0x00007f7314d45d9b in ?? () from /usr/lib64/libcuda.so.1
    #3 0x00007f7315242819 in ?? () from /usr/lib64/libcuda.so.1
    #4 0x00000037f52077e1 in start_thread () from /lib64/libpthread.so.0
    #5 0x00000037f4ae152d in clone () from /lib64/libc.so.6
    (gdb)
    mt_launch_block.cu (4.57 KB)

Link to cross post on StackOverflow: