cuEventElapsedTime slow/blocking?

dwilliamson · August 6, 2014, 11:15am

Hi,

I’m trying to profile some CUDA code running on the device using the driver event API. I have multiple threads issuing CUDA API calls to the same GPU using the same context. Each thread has its own stream and all driver calls are asynchronous.

I have the equivalent of the following code:

// [LIBRARY CODE]
// API wrappers that apply/unapply the required context
// These functions can get called from any thread
void EventCreate(CUevent* event)
{
    cuCtxSetCurrent(global_context);
    cuEventCreate(event, CL_EVENT_DEFAULT);
    cuCtxSetCurrent(0);
}
void EventRecord(CUevent event, CUstream stream)
{
    cuCtxSetCurrent(global_context);
    cuEventRecord(event, stream);
    cuCtxSetCurrent(0);
}
bool EventElapsedTime(float* ms, CUevent start, CUevent end)
{
    cuCtxSetCurrent(global_context);
    bool success = cuEventElapsedTime(ms, start, end) == CUDA_SUCCESS;
    cuCtxSetCurrent(0);
    return success;
}


// [THREAD BEING PROFILED]

// Create events with no blocking and timing active on thread startup
CUevent event_epoch, event_start, event_end;
EventCreate(&event_epoch);
EventCreate(&event_start);
EventCreate(&event_end);

CUstream stream = /* stream for current thread */

// Call once on thread start as a reference point
EventRecord(event_epoch, stream);

// For each bit of work that needs profiling
EventRecord(event_start, stream);
// ...do CUDA work...
EventRecord(event_start, stream);


// [PROFILER THREAD]
// For each bit of work that has been profiled
float ms_start, ms_end;
bool success = EventElapsedTime(&ms_start, event_epoch, event_start);
if (success)
	success = EventElapsedTime(&ms_end, event_epoch, event_end);

This seems to work fine; it’s reporting a lot of useful numbers to me, least.

However, when I try to “profile the profiler” and inspect the CPU performance of calls to cuEventElapsedTime I am periodically getting big measurements; in the order of milliseconds.

Is this expected? Is my use of contexts causing the performance problem? Anything else I could have missed?

Cheers,

Don

dwilliamson · August 6, 2014, 11:47am

OK, that didn’t work. I removed all my context management and simplified the code to:

// [THREAD BEING PROFILED]

// Create events with no blocking and timing active on thread startup
CUevent event_epoch, event_start, event_end;
cuEventCreate(&event_epoch, CL_EVENT_DEFAULT);
cuEventCreate(&event_start, CL_EVENT_DEFAULT);
cuEventCreate(&event_end, CL_EVENT_DEFAULT);

CUstream stream = /* stream for current thread */

// Call once on thread start as a reference point
cuEventRecord(event_epoch, stream);

// For each bit of work that needs profiling
cuEventRecord(event_start, stream);
// ...do CUDA work...
cuEventRecord(event_start, stream);

// [PROFILER THREAD]
// For each bit of work that has been profiled
float ms_start, ms_end;
bool success = cuEventElapsedTime(&ms_start, event_epoch, event_start) == CUDA_SUCCESS;
if (success)
	success = cuEventElapsedTime(&ms_end, event_epoch, event_end) == CUDA_SUCCESS;

Note that my original code used cuEventQuery to see if the result was ready before calling cuEventElapsedTime. Same performance issue.

I’m seeing about 20 events being measured in 15ms! Seems crazy, like there’s a block in there.

Cheers,

Don

dwilliamson · August 6, 2014, 2:08pm

Doing a CUDA API trace from Nsight and sorting by Duration, I can see that cuEventElapsedTime is right at the top, packed next to cuStreamSyncronize and costing similar amounts of time.

It does look like cuEventElapsedTime is doing a complete sync here - can I stop this?

Topic		Replies	Views
CUDA OpenCL comparison CUDA Programming and Performance	9	3403	August 23, 2011
Troubleshooting errors when using the driver API CUDA Programming and Performance	7	9242	July 19, 2009
Adding CUDA streams to threaded software CUDA Programming and Performance	7	8579	August 2, 2011
Timing cuda code I'm sorry for small for dÃ©ja-vu :-) CUDA Programming and Performance	12	35988	July 12, 2011
Can you GUESS this without experimenting? Latencies CUDA Programming and Performance	13	9351	January 7, 2008
Very newbie questions on synchronisation between GPU & CPU, and time measurement CUDA Programming and Performance	4	493	December 17, 2017
Is the driver API thread safe? Specifically the cuStreamSynchronize and cuEventSynchronize functions CUDA Programming and Performance	0	1288	January 7, 2009
CPU core is busy while GPU runs its kernel CUDA Programming and Performance	11	5242	February 11, 2018
Timing using cudaEvent****() VS clock_gettime() CUDA Programming and Performance	6	2159	August 26, 2015
Peaks and slow performance with cudaDeviceSynchronize CUDA Programming and Performance cuda	6	2777	November 17, 2021

cuEventElapsedTime slow/blocking?

Related topics