My issue may not be related. I am not sure that asynchronous code or multiple streams is necessary. It was discovered that profile data could be missing within any of the streams.
The workaround I found was caused mostly by my usage behaviour. Our application takes a long time to get to the kernel execution (up to minutes to load and get started). My behaviour was to launch the profiler, immediately cancel the profile (to avoid minutes of empty profile data), load the rest of the program, and select start to continue capturing profile data.
It turns out that if I just let the profiler (and not “pause” it), most of the profile data would be recorded and reported.
The actual behaviour I was seeing was as follows: I had a sequence of kernels (ignore the fact that it was asynchronous or used multiple streams): A, B, C, D, E, F
The launch behaviour for a given stream would be as follows:
A, B, C, D, E, F, A, B, C, D, E, F, A, B, C, D, E, F, A, B, C, D, E, F, A, B, C, D, E, F, A, B, C, D, E, F, ...
but the profile reporting would show up like (or even much more sparse):
A, , , , E, , , , , , , F, , , , , , , A, B, C, , F, , , , D, , F, , , , , , F, ...