I’m trying to profile my program on a system with an Ampere GPU, but I can’t get any kernel timing out. From reading up on this, it seems the problem is when using an architecture with c.c. >= 7.0.
Is there any way to get around this? I don’t care about the dynamic bit, the total time of my kernel and the overview is what I’m after.
Or, is this a known topic and something that is being fixed?