dynamic instruction frequency how to compute?

Is there any way to compute the dynamic instruction frequency of a kernel execution? For example, a break-down of the total number of instructions into: memory operations, SFU Ops, Control flow, fmadd, other ALU etc.

Something similar is shown in the paper by Bakhoda et. al., figures 4 and 5. However, they use a simulator, and not a real GPU.

Ideally, I would want to know how much time is also spent executing instructions in each category, but that is probably even more difficult, especially in the SIMT model.

I am currently trying to get this information using the CUDA visual profiler, but it does not seem to report such detailed statistics.