Working load of the multiprocessors

Is there any tool that can show the working load of the GPU multiprocessors?

I know that I can see the occupancy in the Cuda Visual Profiler, but this value is on warp basis and this is not exactly what I would like to see.

It could be possible, for example that one thread in a block needs a lot more time thean the others. In this case the multiprocessor would have a lot of idle time and that is the info I would like to see.

Thanks for your help