How to use NVPROF on code compiled with NVRTC?

Hi,
I have been using NVPROF to collect all 113 performance counters from my kernels
that run on a TitanV. I was never able to get CUPTI to give me all the counters the way NVPROF does.

Now I am using NVRTC (with JITIFY) to compile my custom kernels on the fly. How can I get NVPROF
to give me the same 113 performance counters for this NVRTC/JITIFY case?
–Bob

Sorry you are having trouble with this.
nvprof works well with nvrtc kernels on with CUDA 10.0 nvprof here.
What toolkit are you using?
Is it possible for you to provide a minimal reproducer?

Also, it is possible for CUPTI to collect all counters. Unfortunately there is not an example in the sample code provided in the toolkit. We’ll get you a some example code soon.

I was using CUDA 9.x under Win7/64.
I will switch to CUDA 10 today.
It would be magnificent if CUPTI could retrieve all of the counters that NVPROF returns.

There is an nvidia researcher whose cupti code I was hacking on. You can find it here.

It works with 2 metrics with CUDA 10 and dd 416.34 under Win7/64.

"inst_per_warp",
"branch_efficiency",

However, when you use any of these metrics it fails and says
“warp_execution_efficiency”,
“warp_nonpred_execution_efficiency”,
“inst_replay_overhead”,

Metric value retrieval failed for metric warp_execution_efficiency. (for example).

I was using the CUPTI callback_metric example to guide me. If you use any of the above metrics with it,
the sample app works.

I eventually discovered that the new sample code makes reference to CUPTI_RUNTIME_TRACE_CBID_cudaLaunchKernel_v7000
which is something that didnt exist 4-5 years ago when the researcher created his tool.

It would be great to get an example that is up to date for all the metrics.

rbischof – were you able to hack out the example code for all the counters?

There is no direct API to query supported events from CUPTI. Following steps can be used for the same:

  1. CUptiResult cuptiDeviceGetNumEventDomains ( CUdevice device, uint32_t* numDomains ): Get the number of domains for a device.
  2. CUptiResult cuptiDeviceEnumEventDomains ( CUdevice device, size_t* arraySizeBytes, CUpti_EventDomainID* domainArray ): Get the event domains for a device.
  3. CUptiResult cuptiEventDomainGetNumEvents ( CUpti_EventDomainID eventDomain, uint32_t* numEvents ): Get number of events in a domain.
  4. CUptiResult cuptiEventDomainEnumEvents ( CUpti_EventDomainID eventDomain, size_t* arraySizeBytes, CUpti_EventID* eventArray ): Get the events in a domain.

Refer the CUPTI document: https://docs.nvidia.com/cuda/cupti/group__CUPTI__EVENT__API.html

Let us know if you need any additional information.

ssatoor,
Thanks, but I was well aware of the API you described. My responses showed how I was
hacking away at the example code. The callback must be significantly more complex when
there is more than 1 event or metric being monitored. On 10/26, rbischof mentioned that he would
send an example soon. I guess he gave up?
–bz

Sorry for the delay (which continues unfortunately). We haven’t forgotten about you and working to get this sample code to you.

Cool. Thanks for the feedback.

We have updated the CUPTI sample code on github : https://github.com/srvm/cupti_profiler.
It includes fixes for the issues reported on this post.
It also includes query and collection of all the supported metrics.

To profile specific metric you can comment the following line in examples/demo.cu:
#define PROFILE_ALL_EVENTS_METRICS 1

Thanks for your feedback.