CUPTI Activity API giving asynchronous events with bogus(?) timestamps

I am trying to use a TAU ( on a C++ application (built on Magma - that uses multiple threads to submit kernels to multiple streams. TAU uses the CUPTI Activity API to trace events on the GPU. After 2 weeks of debugging, I have determined that the CUPTI Activity API is returning events with bogus(?) timestamps for this code.

I created a simple tracing tool using the code in the activity_trace_async example. I basically took the cpp file from that example and built a shared object library out of it, and added a static global constructor so that I could LD_PRELOAD the library and run the application I want to trace. The library can be found here:

At any rate, after some time of running, the application spits out the following kinds of events… You can see that the correlation ID is monotonically increasing, but the timestamp has somehow gone backwards in time:

MEMCPY HtoD [ 31865754446 - 31865755630 ] device 0, context 1, stream 33, correlation 287972/r0
KERNEL “” [ 31865770254 - 31865774158 ] device 0, context 1, stream 33, correlation 287981
grid [1,1,1], block [16,16,1], shared memory (static 0, dynamic 0)
MEMCPY DtoH [ 28817954356 - 28817959060 ] device 0, context 1, stream 33, correlation 287982/r0

The timestamps continue monotonically increasing from that point on…until at some point in the future that they jump back again. This seems to happen on all(?) streams.

Is this a bug in CUPTI? Is something else going on? I see ‘backwards’ timers of various amounts, from a few seconds to up to 20+ seconds difference.

nvprof has no problem generating a valid .nvvp file for this code, so there must be a CUPTI trick/setting that we are missing (and is also missing from your activity_trace_async example)…?

Hi khuck, could you please give it a try with the increased activity buffer size (BUF_SIZE = 16 or 32 MB) and/or increased device buffer (32 MB) and profiling semaphore pool (512K) controlled using below attributes:

It’d be a good data point if any of these settings avoids or delays the issue you observe.

How many CUDA contexts, CUDA streams, CPU threads, CUDA kernels, memcopies and memsets are launched in the application? What GPU and CUDA toolkit is used? Would it be possible for you to provide a simple reproducer?