I am using the Callback API to take note about CUDA API calls, and later match them with a corresponding activity from the Activity API.
That said, some calls, like cudaFree(0), cudaMemset with size 0, do not generate any activity. Even more puzzling, some cudaLaunchKernel calls (with what seem to be valid arguments) are also not generating any activity.
Is there a way to know at the Callback EXIT point whether or not there will be a corresponding activity happening in response to that call?
Assuming that the corresponding activities that you expect are enabled (CUPTI_ACTIVITY_KIND_MEMSET for memsets, CUPTI_ACTIVITY_KIND_MEMORY2 for memory allocation/frees, etc.), you will get records if and when such activity happens on the device. We do not generate activity records for cudaFree(0), or cudaMemset with size 0 as there are no frees or memsets actually happening under the hood (You will still receive callbacks for that API as well as DRIVER/RUNTIME records if enabled).
To know if a CUDA call will generate device activity (for which CUPTI activity tracing is enabled), you can check if any activity records are generated with the same correlation Id as the DRIVER or RUNTIME records for that API call. There is no other way to tell upfront on what device activity will be generated just from the callback of that API.
Note that the cudaLaunchKernel() API is asynchronous. It will not wait till the kernel execution is completed. So, the kernel activity will be generated later when the kernel execution completes.
You can see CUPTI_ACTIVITY_KIND_CONCURRENT_KERNEL/CUPTI_ACTIVITY_KIND_KERNEL records for a valid kernel launch, by flushing the CUPTI activity buffers after the kernel has executed. If you do not see this behavior, please provide us with a small reproducer.
It may be tricky to provide a reproducible case, since this is happening on things managed by ONNX on a private model.
For context, I was able to further verify that the missing CUPTI_ACTIVITY_KIND_CONCURRENT_KERNEL records are tied to cudaLaunchKernel issued by the ONNX runtime, and the most common kernels where this behavior is observed are:
cudnn reduced_divisor
gemmk1_kernel cublasGemvTensorStridedBatched
The observed arguments to the launch command are all sensible (stream, grid, blocks, return-value).
Please make sure that the API cuptiActivityFlushAll() is called.
Do you get some kernel activity records of kind CUPTI_ACTIVITY_KIND_CONCURRENT_KERNEL?
Do you get kernel activity records for some of the launches for the kernels “cudnn reduced_divisor” and “gemmk1_kernel cublasGemvTensorStridedBatched” but some are missing?
Does the application fail with any error?
Is there any difference in application functionality with and without CUPTI enabled?
To briefly answer your questions: yes, I do get matches for those kernels as well; no errors (and I check the return value of those calls in the exit-callback); no differences whether CUPTI is enabled or not.
I am attaching a trace “digest” that I generated. You’ll find the versions, a count summary of all runtime/driver calls, a summary of API calls with no matching activity, along with a breakdown in terms of kernel names and function-and-arguments, and then, for completeness, a similar breakdown list for which there was activity match.
It’s also interesting to note that there were 8 calls to cudaMemsetAsync with “normal” arguments that also did not generate any activity. report.txt (17.2 KB)
I should also add that these results are reproducible and deterministic.
Thanks for sharing the report. Your report contains the calls to cudaStreamBeginCapture() and cudaStreamEndCapture(). The APIs cudaLaunchKernel() or cudaMemsetAsync() may be falling within a stream capture range. When a stream is in capture mode, all operations pushed into the stream will not be executed but will instead be captured into the graph.