How to get the absolute time of (async) memcpys? Aka: Parallel Nsight Analyzer like graph

I’m using memcpys asynchronous with kernels. I’d like to visually see when memcpys start and end, in relation with the kernels, in a visual way. More or less like Parallel Nsight Analyzer does in visual studio, but I need it on non-windows platforms (text would be ok too, I can draw it later).

Example:

If Parallel Nsight Analyzer doesn’t use some specia/hidden hardware features, it passes through the runtime or driver API.

To obtain the time when a kernel exactly starts/stops I can use clock() inside the kernel, but what about memory transfers?

Please note I’m not interested in the relative execution time of kernels and transfers. There is cudaEventElapsedTime() for that, but I can only obtain the length and not how much they overlap. Moreover, asynchronous operations belong to separate streams and thus I can’t even take the relative time. Comparing the total execution time with a CPU time would be of course much inaccurate.

Maybe there is a way to get the absolute time of when an event has been recorded on GPU? Or, a driver API function to get absolute timing information about async memory transfers?

I’m using memcpys asynchronous with kernels. I’d like to visually see when memcpys start and end, in relation with the kernels, in a visual way. More or less like Parallel Nsight Analyzer does in visual studio, but I need it on non-windows platforms (text would be ok too, I can draw it later).

Example:

If Parallel Nsight Analyzer doesn’t use some specia/hidden hardware features, it passes through the runtime or driver API.

To obtain the time when a kernel exactly starts/stops I can use clock() inside the kernel, but what about memory transfers?

Please note I’m not interested in the relative execution time of kernels and transfers. There is cudaEventElapsedTime() for that, but I can only obtain the length and not how much they overlap. Moreover, asynchronous operations belong to separate streams and thus I can’t even take the relative time. Comparing the total execution time with a CPU time would be of course much inaccurate.

Maybe there is a way to get the absolute time of when an event has been recorded on GPU? Or, a driver API function to get absolute timing information about async memory transfers?

I found out how to get absolute gpu timestamps: one needs to run the text profiler.

I’m still managing to find out what are some undocumented fields (e.g. TIMESTAMPFACTOR in .csv output).