Reading all events through CUPTI

I am trying to see all the events provided by the K20 Tesla that I have. CUPTI provides several functions to enlist those events. The events are divided into different domains and the code cuda-5.0/extras/CUPTI/sample/cupti_query provides the events for the first domainId. I came across the function

CUptiResult cuptiDeviceEnumEventDomains ( CUdevice device,
size_t* arraySizeBytes,
CUpti_EventDomainID* domainArray )

and in the cupti_query code, the domainArray is just a single domainID element rather than an array.

How can I access different domains using these functions to get the details of all the events present on my K20?

This code was written in a browser. Is your code doing something of the form

CUresult curesult = CUDA_SUCCESS;
CUptiResult cuptiresult = CUPTI_SUCCESS;

curesult = cuInit(0);
CHECK_CUDA(curesult);

CUdevice device;
curesult = cuDeviceGet(&device, 0);
CHECK_CUDA(curesult);

uint32_t numDomains = 0;
cuptiresult = cuptiDeviceGetNumEventDomains(device, &numDomains);
CHECK_CUPTI(cuptiresult);

size_t allocBytes = sizeof(CUpti_EventDomainID) * numDomains);
CUpti_EventDomainID* pDomainIDs = (CUpti_EventDomainID*)malloc(allocBytes);
CHECK(pDomainIDs);

size_t domainBytes = allocBytes;
cuptiresult = cuptiDeviceEnumEventDomains(device, &domainBytes, pDomainIDs);
CHECK_CUPTI(cuptiresult);

CHECK(domainBytes == allocBytes);

// do something

free(pDomainIDs);

Yes the code is doing something of this sort. I was actually trying to get various events which I got through “nvprof --query-events”. What I am trying to do is to read the GPU events like “inst_executed” periodically while trying to execute some benchmarks like Rodinia in a manner transparent to the application. I am noticing that when I use “/usr/local/cuda-5.0/extras/CUPTI/sample/event_sampling” for measuring instructions executed, it shows absolutely no change in number of instructions executed when I execute a Rodinia CUDA benchmark along with it. Does that mean GPU events cannot be read transparently to the application code?

gokussy9,

On Fermi and Kepler the SM PM counters that CUPTI can access are context switched. These counter values can be read only if you pass in the CUcontext handle. The PM counter for TEX, L2, and FB are not context switched so it would be possible to sample these values from any context as long as the context is on the same CUDA device.

If you have the source code for the benchmark then you can create a background thread and sign-up CUPTI callbacks to identify and track work submitted on CUDA contexts. The sampling code would look similar to the event_sampling CUPTI sample (uses CUPTI_EVENT_COLLECTION_MODE_CONTINUOUS).

If you do not have the source code for the benchmark then you will have to inject a background thread into the process. This is operating system specific and it is not necessarily an easy task. On Windows you can use Microsoft Detours and on Linux you can use LD_PRELOAD.

This means, I have to create the background thread in the source code itself to be able to read the counter events. So, I cannot read the counters transparently to the source code like I do with an Intel CPU without using the LD_PRELOAD technique?

Waiting for your reply?

Correct. CUPTI does not support reading all counters at a device level from a different process.

It is possible to sample TEX, L2, and FB counters from a different process. The counter values will be global devices values.

It is not possible to sample SM counters from a different process. These counters currently can only be read by passing the CUcontext handle in the same process to the CUPTI API.

Thanks for your reply. Can you please guide me towards some docmuentation where I can find
about the counters available which can be read from different processes for global device values.

Can the instructions executed counter be read transparently from a different process during execution of a CUDA benchmark?

The CUPTI documentation does not provide any details on this behavior. This behavior may change in a future version. inst_executed is an SM counter and cannot be accessed from a different process with CUDA 5.* version of CUPTI.

I am also looking to monitor performance counters transparently from another process. I don’t access to the source code and am using GTX 960.

I am able to access some of the performance counters through papi but many just return 0.

Has anything changed with this or another ideas to approach the issue? Thanks.