Hi everybody,
As i am doing some microbenchmarking of some asynchronous mechanisms implemented with CUDA, i’ve noticed a very strange behaviour: when cudaEventCreate is called many times, its cost becomes extremely important. When executing the following piece of code with different numbers of iterations, the average duration of cudaEventCreate gets as high as multiple hundreds of micro-second, which is just huge.
[codebox]
cudaEvent_t *events = calloc(niter, sizeof(cudaEvent_t));
gettimeofday(&start_create, NULL);
unsigned iter;
for (iter = 0; iter < niter; iter++)
{
status = cudaEventCreate(&events[iter]);
assert(!status);
}
gettimeofday(&end_create, NULL);
[/codebox]
When trying to analyze what is going on with oprofile, i noticed that when the duration is about 100µs, 95% of the time is spent in libcudart, which gives some insights of why this bug only affects the runtime API and not the driver API (the cost of cuEventCreate is pretty much constant at ~300ns on this machine). Using some performance counters would certainly make it possible to understand what’s going on, but this looks like the runtime driver keeps track of all events in some highly inefficient way (at least with respect to the role of cudaEventCreate, which should certainly not involve so expensive operations).
Am i the only one to observe such performance issues ? (i’m using Linux and the 3.0-beta driver). And if this is really a design issue, is there any guidelines about how to use events in a scalable way ?
Hoping somebody will get more insights than i do …
Cédric
eventcreate.pdf (7.98 KB)