I am trying to get event metrics per threads or warp via CUPTI. For example, supposed that I spawn 256 threads in my kernel code, I want to get 256 event metrics. How to obtain events of each thread via CUPTI library? I know event_sample in the CUPTI library show us using pthread to get sampling event metrics. However, I want to know event metrics of each thread. Is it possible?
CUPTI does not support collection of events at the per thread block, warp, or thread level. It is possible to collect this information for deterministic events such as inst_executed, branches, divergent_branches, etc. However, it is not possible to collect non-deterministic events such as stall reasons, cache misses, etc. at this level of granularity.
I recommend that you enter a feature request through the registered developer program. Please include what types of events you are interested in collecting and how you would specify the target threadIds and warpIds.
Per thread metrics will only vary from warp level metrics if your code is highly divergent.