Is there any way to profile instruction count of cuda program?

I want to profile the instruction count of CUDA program for every certain intervals.

I found out that I can use Nsight, but I need to build my own profiling tool to merge some other data.

I think CUPTI can do this, but I don’t know whether CUPTI allows to profile the other process.(profiling and cuda kernel run on different program)

So is there any way to profile the instruction count of cuda kernel dynamically ?