Problem with profiling my program with dynamic parallelism on Ampere architecture


I’m trying to profile my program on a system with an Ampere GPU, but I can’t get any kernel timing out. From reading up on this, it seems the problem is when using an architecture with c.c. >= 7.0.

Is there any way to get around this? I don’t care about the dynamic bit, the total time of my kernel and the overview is what I’m after.
Or, is this a known topic and something that is being fixed?


which profiler?

I’m using the NSight Systems 2022.1.1, but the 2021.3.3 also fails.

I compile the code using Cuda 9.2.148.

The recommended CUDA version for Ampere is 11.0 or newer (for cc8.0, 11.1 or newer for cc8.6)

I’m not aware of any current limitations on CDP profiling in Nsight systems, but it’s possible there may be. You might want to ask your question on the nsight systems forum.