events/markers within kernel

I have a kernel that has multiple phases to it. I’d like to stick some sort of event or marker in between the phases to see how long each phase takes. I can do this on host code using NVTX and/or the runtime API cudaEvent* calls. But, I don’t want to break up the kernel so that each phase becomes a separate kernel. The NVTX markers/ranges don’t work within device code. cudaEvent Create/Destroy/Record do seem to have device versions, but that’s all which leaves me at a loss to understand how to use them for what I’m trying to do. I’d be happy to pack up the cudaEvents and copy them to the host for analysis, but I can’t find a definition/handle for the actual underlying structure of events. (I haven’t actually tried it but I don’t expect copying the pointers from the device to host is going to do me any good).

I guess the broader question is what, if any, techniques are available to instrument code intra-kernel.

The usual suggestion is to use clock64()