Clarification on `cudaLaunchKernel` Count in `nsys stats -r cuda_api_gpu_sum

I’m trying to understand the exact scope of cudaLaunchKernel statistics when using:

nsys stats -r cuda_api_gpu_sum path/to/*.sqlite

Specifically:
Does this metric include only user-level CUDA Runtime API calls (e.g., explicit <<<>>> or cudaLaunchKernel()), or does it also count:

  • Implicit Runtime-to-Driver API conversions (e.g., when Runtime calls cuLaunchKernel internally)
  • Direct Driver API calls (cuLaunchKernel) if used in the application?

I have this question because I compared the nsys result with eBPF traces, and I noticed discrepancies in cudaLaunchKernel counts.

Thank you!

@liuyis

Implicit Runtime-to-Driver API conversions (e.g., when Runtime calls cuLaunchKernel internally)” are skipped.

Direct Driver API calls (cuLaunchKernel) if used in the application” are included in Nsys report.

Thanks for the quick reply. However, I tested with a simple program that use cuLaunchKernel driver API. The nsys reported directly cuLaunchKernel instead of cudaLaunchKernel. I think then the cudaLaunchKernel shows in the report should all belong to the Runtime API right?

I tested with a simple program that use cuLaunchKernel driver API. The nsys reported directly cuLaunchKernel instead of cudaLaunchKernel.

That’s what I meant - if you use driver API like cuLaunchKernel directly, then Nsys will capture the driver API cuLaunchKernel and show in the report. But if a driver API is invoked by a runtime API under the hood, then Nsys will skip it - so if you call cudaLaunchKernel, it will only capture cudaLaunchKernel rather than both cudaLaunchKernel and cuLaunchKernel.

Thank you:)