Is it possible to profile Cooperative Groups syncs?

I am using grid.sync() inside a kernel that is run with cudaLaunchCooperativeKernel(). So far I couldn’t find any info on how to profile synchronization overheads inside the kernel. Is there any way to do it out of the box or do I need to do some sort of kernel instrumentation?
Thanks

@fschmidt Is this something Nsight Compute would help with?