cuEventElapsedTime slow/blocking?

Hi,

I’m trying to profile some CUDA code running on the device using the driver event API. I have multiple threads issuing CUDA API calls to the same GPU using the same context. Each thread has its own stream and all driver calls are asynchronous.

I have the equivalent of the following code:

// [LIBRARY CODE]
// API wrappers that apply/unapply the required context
// These functions can get called from any thread
void EventCreate(CUevent* event)
{
    cuCtxSetCurrent(global_context);
    cuEventCreate(event, CL_EVENT_DEFAULT);
    cuCtxSetCurrent(0);
}
void EventRecord(CUevent event, CUstream stream)
{
    cuCtxSetCurrent(global_context);
    cuEventRecord(event, stream);
    cuCtxSetCurrent(0);
}
bool EventElapsedTime(float* ms, CUevent start, CUevent end)
{
    cuCtxSetCurrent(global_context);
    bool success = cuEventElapsedTime(ms, start, end) == CUDA_SUCCESS;
    cuCtxSetCurrent(0);
    return success;
}


// [THREAD BEING PROFILED]

// Create events with no blocking and timing active on thread startup
CUevent event_epoch, event_start, event_end;
EventCreate(&event_epoch);
EventCreate(&event_start);
EventCreate(&event_end);

CUstream stream = /* stream for current thread */

// Call once on thread start as a reference point
EventRecord(event_epoch, stream);

// For each bit of work that needs profiling
EventRecord(event_start, stream);
// ...do CUDA work...
EventRecord(event_start, stream);


// [PROFILER THREAD]
// For each bit of work that has been profiled
float ms_start, ms_end;
bool success = EventElapsedTime(&ms_start, event_epoch, event_start);
if (success)
	success = EventElapsedTime(&ms_end, event_epoch, event_end);

This seems to work fine; it’s reporting a lot of useful numbers to me, least.

However, when I try to “profile the profiler” and inspect the CPU performance of calls to cuEventElapsedTime I am periodically getting big measurements; in the order of milliseconds.

Is this expected? Is my use of contexts causing the performance problem? Anything else I could have missed?

Cheers,

  • Don

OK, that didn’t work. I removed all my context management and simplified the code to:

// [THREAD BEING PROFILED]

// Create events with no blocking and timing active on thread startup
CUevent event_epoch, event_start, event_end;
cuEventCreate(&event_epoch, CL_EVENT_DEFAULT);
cuEventCreate(&event_start, CL_EVENT_DEFAULT);
cuEventCreate(&event_end, CL_EVENT_DEFAULT);

CUstream stream = /* stream for current thread */

// Call once on thread start as a reference point
cuEventRecord(event_epoch, stream);

// For each bit of work that needs profiling
cuEventRecord(event_start, stream);
// ...do CUDA work...
cuEventRecord(event_start, stream);

// [PROFILER THREAD]
// For each bit of work that has been profiled
float ms_start, ms_end;
bool success = cuEventElapsedTime(&ms_start, event_epoch, event_start) == CUDA_SUCCESS;
if (success)
	success = cuEventElapsedTime(&ms_end, event_epoch, event_end) == CUDA_SUCCESS;

Note that my original code used cuEventQuery to see if the result was ready before calling cuEventElapsedTime. Same performance issue.

I’m seeing about 20 events being measured in 15ms! Seems crazy, like there’s a block in there.

Cheers,

  • Don

Doing a CUDA API trace from Nsight and sorting by Duration, I can see that cuEventElapsedTime is right at the top, packed next to cuStreamSyncronize and costing similar amounts of time.

It does look like cuEventElapsedTime is doing a complete sync here - can I stop this?