Is there a way to know when a cuda call will generate device activity?

slomp · April 23, 2025, 3:27pm

Hello,

I am using the Callback API to take note about CUDA API calls, and later match them with a corresponding activity from the Activity API.

That said, some calls, like cudaFree(0), cudaMemset with size 0, do not generate any activity. Even more puzzling, some cudaLaunchKernel calls (with what seem to be valid arguments) are also not generating any activity.

Is there a way to know at the Callback EXIT point whether or not there will be a corresponding activity happening in response to that call?

Regards.

Sanjiv.Satoor · April 25, 2025, 10:38am

Assuming that the corresponding activities that you expect are enabled (CUPTI_ACTIVITY_KIND_MEMSET for memsets, CUPTI_ACTIVITY_KIND_MEMORY2 for memory allocation/frees, etc.), you will get records if and when such activity happens on the device. We do not generate activity records for cudaFree(0), or cudaMemset with size 0 as there are no frees or memsets actually happening under the hood (You will still receive callbacks for that API as well as DRIVER/RUNTIME records if enabled).

To know if a CUDA call will generate device activity (for which CUPTI activity tracing is enabled), you can check if any activity records are generated with the same correlation Id as the DRIVER or RUNTIME records for that API call. There is no other way to tell upfront on what device activity will be generated just from the callback of that API.

Note that the cudaLaunchKernel() API is asynchronous. It will not wait till the kernel execution is completed. So, the kernel activity will be generated later when the kernel execution completes.

You can see CUPTI_ACTIVITY_KIND_CONCURRENT_KERNEL/CUPTI_ACTIVITY_KIND_KERNEL records for a valid kernel launch, by flushing the CUPTI activity buffers after the kernel has executed. If you do not see this behavior, please provide us with a small reproducer.

slomp · April 25, 2025, 10:56pm

Thank you for the reply.

It may be tricky to provide a reproducible case, since this is happening on things managed by ONNX on a private model.

For context, I was able to further verify that the missing CUPTI_ACTIVITY_KIND_CONCURRENT_KERNEL records are tied to cudaLaunchKernel issued by the ONNX runtime, and the most common kernels where this behavior is observed are:

cudnn reduced_divisor
gemmk1_kernel cublasGemvTensorStridedBatched

The observed arguments to the launch command are all sensible (stream, grid, blocks, return-value).

Sanjiv.Satoor · April 28, 2025, 5:27am

Please make sure that the API cuptiActivityFlushAll() is called.

Do you get some kernel activity records of kind CUPTI_ACTIVITY_KIND_CONCURRENT_KERNEL?

Do you get kernel activity records for some of the launches for the kernels “cudnn reduced_divisor” and “gemmk1_kernel cublasGemvTensorStridedBatched” but some are missing?

Does the application fail with any error?

Is there any difference in application functionality with and without CUPTI enabled?

Please share more details of your setup:

CUDA Toolkit Version
CUPTI version
CUDA Driver version
Operating system

slomp · April 28, 2025, 11:26pm

To briefly answer your questions: yes, I do get matches for those kernels as well; no errors (and I check the return value of those calls in the exit-callback); no differences whether CUPTI is enabled or not.

I am attaching a trace “digest” that I generated. You’ll find the versions, a count summary of all runtime/driver calls, a summary of API calls with no matching activity, along with a breakdown in terms of kernel names and function-and-arguments, and then, for completeness, a similar breakdown list for which there was activity match.

It’s also interesting to note that there were 8 calls to cudaMemsetAsync with “normal” arguments that also did not generate any activity.
report.txt (17.2 KB)

I should also add that these results are reproducible and deterministic.

RahulDhoot · April 29, 2025, 6:13am

Thanks for sharing the report. Your report contains the calls to cudaStreamBeginCapture() and cudaStreamEndCapture(). The APIs cudaLaunchKernel() or cudaMemsetAsync() may be falling within a stream capture range. When a stream is in capture mode, all operations pushed into the stream will not be executed but will instead be captured into the graph.

slomp · April 29, 2025, 4:25pm

Ah! That makes sense and explains a lot!
I’ll do some tracing and report back.

Besides capture, is there any other functionality I should be aware of that causes similar deferred behavior?

RahulDhoot · April 30, 2025, 8:51am

Good to know that your query got resolved.
Apart from this, there should not be any case that causes deferred behaviour.

Topic		Replies	Views
CUPTI initalization and CUDA API calls CUPTI – CUDA Profiler Tools Interface	3	1298	June 27, 2022
Get launch kernel response time by CUPTI CUPTI – CUDA Profiler Tools Interface	7	1237	May 9, 2023
CUPTI activity_trace_async example doesn't work with CUDA/CUPTI 10.1.243 (Update 2) CUPTI – CUDA Profiler Tools Interface	2	825	October 18, 2019
Profiling cuda graph with CUPTI Profiling API CUPTI – CUDA Profiler Tools Interface tensorrt	2	1168	March 13, 2024
CUDA Profiler Tools Interface (CUPTI) for CUDA Toolkit 12.4 is now available CUPTI – CUDA Profiler Tools Interface	7	374	October 23, 2024
CUPTI Sample Code problem CUPTI – CUDA Profiler Tools Interface cuda , kernel	11	143	June 26, 2025
CUPTI Activity API giving asynchronous events with bogus(?) timestamps CUPTI – CUDA Profiler Tools Interface	1	614	November 6, 2019
Issue using CUPTI Activity API with CUDA Graph enabled Jetson Orin Nano tensorrt , cuda	5	305	June 19, 2024
Memory leakage with CUPTI activity API CUPTI – CUDA Profiler Tools Interface	3	690	April 8, 2020
Using CUpti_ActivityKernel4 to find the start and end time in ns for a kernel wrapped in a function CUPTI – CUDA Profiler Tools Interface	7	1136	January 23, 2020

Is there a way to know when a cuda call will generate device activity?

Related topics