Setting a breakpoint on CUDA runtime API function

Hello! Say i want to find out the callstack for a particular runtime API function. How would I go about that? In my particular case I tried setting a breakpoint in cuda-gdb on cudaStreamSynchronize

b cudaStreamSynchronize

Function “cudaStreamSynchronize” not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (cudaStreamSynchronize) pending.


Breakpoint never hits.

Any suggestions?


set a breakpoint on a particular line of code in your file that calls cudaStreamSynchronize?

by the way I tried your method just now using gdb and cudaDeviceSynchronize, on an application with cudart statically linked, and it worked just fine:

$ gdb ./t1
GNU gdb (GDB) Fedora 8.0.1-36.fc27
Copyright (C) 2017 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
Find the GDB manual and other documentation resources online at:
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./t1...done.
(gdb) b cudaDeviceSynchronize
Breakpoint 1 at 0x434db0
(gdb) r
Starting program: /home/bob/misc/t1
Missing separate debuginfos, use: dnf debuginfo-install glibc-2.26-15.fc27.x86_64
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/".
[New Thread 0x7ffff5909700 (LWP 7548)]
[New Thread 0x7ffff5108700 (LWP 7549)]
[New Thread 0x7ffff4907700 (LWP 7550)]

Thread 1 "t1" hit Breakpoint 1, 0x0000000000434db0 in cudaDeviceSynchronize ()
Missing separate debuginfos, use: dnf debuginfo-install libgcc-7.3.1-6.fc27.x86_64 libstdc++-7.3.1-6.fc27.x86_64

Thank you txbob. Say I have a dynamically linked .so (in my case a potentially unknown/third party cuda extension to pytorch) that gets called from inside python and no access to the source of some of the intermediate .sos (or in some cases no time to rebuild as it takes a lot of time to figure out the right build setup for a complex library). Plus with multiple places that call cudaDeviceSynchronize presumably i’d have to write a wrapper and instrument the code in multiple places to call the wrapper instead. Also ongoing maintenance of such code injection doesn’t scale very well in terms of time spent so figuring out some kind of a non-intrusive solution with breakpoints in cuda runtime API seems like it would be nice.