Associating Kernel Names with Range ID

Hi everyone,

I’m working with the CUPTI (CUDA Profiling Tools Interface) and I have encountered a challenge regarding kernel profiling. In the Activity API, I can retrieve detailed information about profiled kernels, including their names and start/end timestamps. However, when I use the Profiling API, the kernels are identified by unique IDs instead of names.

My goal is to correlate the performance counters obtained from the Profiling API with the kernel names from the Activity API.

Is there a recommended way to combine these two pieces of information? Specifically, can I reliably map the unique kernel IDs from the Profiling API back to the names provided by the Activity API? Any insights or examples on how to achieve this would be greatly appreciated!

Thanks in advance for your help!

As CUPTI profiling works at context level, so for a multi-ctx application, when we enable profiling for a single context it will only profile the kernels which are launched in that context. For e.g.

kernelA<<>>() <- ctx1

kernelB<<>>() <- ctx1

kernelC<<>>() <- ctx2

kernelD<<>>() <- ctx1

when profiling is enabled for ctx1, we get profiling data for kernelA (range index 0), kernelB (range index 1), kernelD (range index2).

For correlating the range data to kernel, you can use the approach mentioned in callback_profiling sample (refer ProfilingCallbackHandler function) shipped in the CUPTI package, where we use CUPTI callback APIs for getting the kernel launch sequence and the kernel name and on which ctx the kernel is launched. Then maintain a table which will have the kernel launch sequence and finally you can map the range index to kernel launch.

Thank you for your response. I am currently trying to combine the Activity API with the Profiling API to collect both activity data and low-level performance counters. I am following the pattern in the callback_profiling sample by enabling profiling for the contexts when they are created using the Callback API. Additionally, I enable some activity collection, such as concurrent kernel collection, by calling cuptiActivityEnable(CUPTI_ACTIVITY_KIND_CONCURRENT_KERNEL).

The problem is that after I call cuptiProfilerEnableProfiling(&enableProfilingParams), the Activity API seems to stop working. I tried calling cuptiActivityFlushAll to retrieve the records in the buffer, but nothing happens (The registered callback function is not invoked). However, if I don’t enable profiling with cuptiProfilerEnableProfiling(&enableProfilingParams), the activity buffer works as expected.

I am wondering if the Activity API and Profiling API are not designed to work together, or if there might be a bug in my code. Thanks in advance!

Hi Suraj,

I have the similar question when using the latest range profiling apis(see How to correlate range profiling metrics with a certain kernel?). Thank you for the suggestion. Using the callback API for kernel sequencing is a good approach.

However, our tool uses ptrace for mid-process injection, which means we can’t ensure the callback and range profiling start simultaneously. This can cause the kernel sequence from the callback to be misaligned with the range indices from the profiler.

To solve this, my question is: Is there a common identifier, such as a correlationId, exposed in both the range profiling results and the kernel launch callback data? This would allow us to establish a definitive link between a range and a specific kernel, regardless of any timing discrepancies at startup.

Hi Frank, I found your post about associating a kernel with a range in auto range mode. I’m tackling a similar issue myself and was curious if you ever found a solution for that? How to associate a range with a kernel?

Apologies for digging up an old topic, but any information you might have would be a great help.

Thanks so much!

Hi Tianyao,

When injecting CUPTI into a running process, you can subscribe to the Driver API callbacks—particularly for kernel launch events. CUPTI provides notifications for both API_ENTER and API_EXIT phases of these calls.

To ensure proper profiling of the kernel, you can start range profiling in the API_ENTER callback and stop it in the API_EXIT callback. This approach also allows you to read and store the profiling data at the right time. Additionally, CUPTI provides the kernel name in the callbackData, which you can use to map the profiling results to specific kernel launches.

Refer to below pseudocode and for more detailed example you can refer to callback_profiling sample.

CUpti_SubscriberHandle subscriber;
cuptiSubscribe(&subscriber, (CUpti_CallbackFunc)ProfilingCallbackHandler, NULL);
cuptiEnableCallback(1, subscriber, CUPTI_CB_DOMAIN_DRIVER_API, CUPTI_DRIVER_TRACE_CBID_cuLaunchKernel);

//...
// Inside the callback handler for CUPTI_CB_DOMAIN_DRIVER_API domain
case CUPTI_DRIVER_TRACE_CBID_cuLaunchKernel:
{
    if (pCallbackInfo->callbackSite == CUPTI_API_ENTER) {
          // Start Range profiling
     else {
          // Stop Range Profiling 
          // Decode and evaluate counter data image 
     }
}

Please let me know, if this helps.

Hi Suraj,

Thanks very much for your reply.

I tried to use range profiling (with callbacks) and simultaneously enabled the CUPTI Activity API to monitor CUPTI_ACTIVITY_KIND_CONCURRENT_KERNEL.

However, I found that once range profiling is active, the Activity API either captures no CUPTI_ACTIVITY_KIND_CONCURRENT_KERNEL events, or all the captured kernel names are WaitNs.

Is this a known limitation? Can range profiling and the Activity API for concurrent kernels not be used together?

Thanks for your help.

Best regards.

Range Profiling and Activity APIs can be used together, but with some caveats. For concurrent kernel records, you may receive incomplete data—typically with zero timestamps—particularly in Kernel Replay mode. Additionally, CUPTI does not flush incomplete records unless explicitly triggered via cuptiActivityFlushAll(1) .

>> I found that once range profiling is active, the Activity API either captures no CUPTI_ACTIVITY_KIND_CONCURRENT_KERNEL events, or all the captured kernel names are WaitNs .:

This is expected to some extent, but the WaitNs kernel names you are getting, might be a bug in CUPTI. To investigate further, we may need a reproducible example or a detailed call sequence. Personally, I couldn’t see any records when combining Activity API with Range Profiling.

If you only need kernel attributes like kernel name, launch config and is okay with zero timestamps, this setup might still be usable. However, if accurate timestamps are critical, this approach is not ideal due to the following reasons:

  • While profiling all kernel launches are serialized, defeating the purpose of CUPTI_ACTIVITY_KIND_CONCURRENT_KERNEL .
  • Profiling introduces significant overhead compared to tracing, which skews timestamp data and makes it unreliable.

Hi Suraj,

Thank you for your help. I’ve made some good progress based on your suggestions.

I was able to successfully collect both activity and range profiling data simultaneously. The method of using user ranges with user replay, where I push a range on a callback’s enter and pop on its exit, worked perfectly.

For this, I targeted the cudaKernelLaunch runtime API callback. However, I’ve run into a challenge with this approach. My ultimate goal is to get the NVLink metrics for NCCL-related kernels only. Unfortunately, it seems there is no direct correlation between the cudaKernelLaunch API calls and the specific NCCL kernels that are subsequently executed. As a result, the NVLink metrics I’m collecting aren’t correctly associated with the NCCL kernels I’m interested in.

So, my question is: since the callback approach on cudaKernelLaunch isn’t suitable for this specific task, what would be the best way to correctly associate the NVLink metrics generated by each NCCL kernel with that specific kernel instance?

Any insights you have on this would be greatly appreciated.

Best regards

Can you list what runtime or driver API callbacks you are targeting/subscribing? I believe NCCL kernels can be tracked with your approach.

Can you please check if subscribing below driver APIs help in tracking NCCL kernels in your case:

  • CUPTI_DRIVER_TRACE_CBID_cuLaunchKernel
  • CUPTI_DRIVER_TRACE_CBID_cuLaunchKernel_ptsz
  • CUPTI_DRIVER_TRACE_CBID_cuLaunchKernelEx
  • CUPTI_DRIVER_TRACE_CBID_cuLaunchKernelEx_ptsz
  • CUPTI_DRIVER_TRACE_CBID_cuLaunchCooperativeKernel
  • CUPTI_DRIVER_TRACE_CBID_cuLaunchCooperativeKernel_ptsz
  • CUPTI_DRIVER_TRACE_CBID_cuLaunchCooperativeKernelMultiDevice

Hi Suraj,

Thanks very much! I’ve successfully used cuLaunchKernelEx to link NCCL kernels. However, I’ve noticed a timing difference in the profiler.

In auto range mode, standard CUDA kernels execute within their launch API’s lifecycle (Image 1). This allows range profiling and callbacks to capture their metrics perfectly.

In contrast, the NCCL kernel’s execution starts after its cuLaunchKernelEx call has already completed (Image 2).

My question is: When using range profiling, can a callback associated with cuLaunchKernelEx still accurately capture the NVLink metrics for the NCCL kernel, given this timing delay? I am concerned the callback will fire too early.

Thanks for your clarification.

Best regards,