Hi,
I’m trying to profile some CUDA code running on the device using the driver event API. I have multiple threads issuing CUDA API calls to the same GPU using the same context. Each thread has its own stream and all driver calls are asynchronous.
I have the equivalent of the following code:
// [LIBRARY CODE]
// API wrappers that apply/unapply the required context
// These functions can get called from any thread
void EventCreate(CUevent* event)
{
cuCtxSetCurrent(global_context);
cuEventCreate(event, CL_EVENT_DEFAULT);
cuCtxSetCurrent(0);
}
void EventRecord(CUevent event, CUstream stream)
{
cuCtxSetCurrent(global_context);
cuEventRecord(event, stream);
cuCtxSetCurrent(0);
}
bool EventElapsedTime(float* ms, CUevent start, CUevent end)
{
cuCtxSetCurrent(global_context);
bool success = cuEventElapsedTime(ms, start, end) == CUDA_SUCCESS;
cuCtxSetCurrent(0);
return success;
}
// [THREAD BEING PROFILED]
// Create events with no blocking and timing active on thread startup
CUevent event_epoch, event_start, event_end;
EventCreate(&event_epoch);
EventCreate(&event_start);
EventCreate(&event_end);
CUstream stream = /* stream for current thread */
// Call once on thread start as a reference point
EventRecord(event_epoch, stream);
// For each bit of work that needs profiling
EventRecord(event_start, stream);
// ...do CUDA work...
EventRecord(event_start, stream);
// [PROFILER THREAD]
// For each bit of work that has been profiled
float ms_start, ms_end;
bool success = EventElapsedTime(&ms_start, event_epoch, event_start);
if (success)
success = EventElapsedTime(&ms_end, event_epoch, event_end);
This seems to work fine; it’s reporting a lot of useful numbers to me, least.
However, when I try to “profile the profiler” and inspect the CPU performance of calls to cuEventElapsedTime I am periodically getting big measurements; in the order of milliseconds.
Is this expected? Is my use of contexts causing the performance problem? Anything else I could have missed?
Cheers,
- Don