I am using nvprof to profile the GPU workload. However, I am wondering if there is a way to monitor GPU workload on run time in more detail.
For example, I run two vision applications on Tegra A and want to run a separate monitoring task is launched to capture traces on run time. The capture data provides how many # of SM, # of blocks, and # of thread are being used by each kernel. I am thinking to use Cupti APIs but it would be nice if there is a way to get the run-time profile information without modifying the application codes.