Performance bug : cost of cudaEventCreate grows linearly

Hi everybody,

As i am doing some microbenchmarking of some asynchronous mechanisms implemented with CUDA, i’ve noticed a very strange behaviour: when cudaEventCreate is called many times, its cost becomes extremely important. When executing the following piece of code with different numbers of iterations, the average duration of cudaEventCreate gets as high as multiple hundreds of micro-second, which is just huge.

[codebox]

    cudaEvent_t *events = calloc(niter, sizeof(cudaEvent_t));

gettimeofday(&start_create, NULL);

unsigned iter;

    for (iter = 0; iter < niter; iter++)

    {

            status = cudaEventCreate(&events[iter]);

            assert(!status);

    }

gettimeofday(&end_create, NULL);

[/codebox]

When trying to analyze what is going on with oprofile, i noticed that when the duration is about 100µs, 95% of the time is spent in libcudart, which gives some insights of why this bug only affects the runtime API and not the driver API (the cost of cuEventCreate is pretty much constant at ~300ns on this machine). Using some performance counters would certainly make it possible to understand what’s going on, but this looks like the runtime driver keeps track of all events in some highly inefficient way (at least with respect to the role of cudaEventCreate, which should certainly not involve so expensive operations).

Am i the only one to observe such performance issues ? (i’m using Linux and the 3.0-beta driver). And if this is really a design issue, is there any guidelines about how to use events in a scalable way ?

Hoping somebody will get more insights than i do …

Cédric
eventcreate.pdf (7.98 KB)

Are you destroying the events? I could certainly see the cost growing linearly if you just keep allocating more and more of them.

Hi Gregory,

(been a long time i wanted to talk with you ^^)

Indeed i’m not deallocated them, which is not realistic, but i was still surprise that the allocation of such a ressource would depend on the number of already allocated events. Here is another example of workload where i keep a certain number of events allocated (at most “window”), in that case, the cost of an event creation looks proportionnal to the number of events currently allocated.

[codebox]

    cudaEvent_t *events = calloc(niter, sizeof(cudaEvent_t));

unsigned iter;

    for (iter = 0; iter < niter; iter++)

    {

            cudaEventCreate(&events[iter]);

if (iter > window)

                    cudaEventDestroy(events[iter - window]);

    }

[/codebox]

I had not really looked at the value of the “cudaEvent_t” once its created: this is just a number (so that we have an API where we only pass the events and not pointers to the events. Would event creation be some kind of bitmap traversal to find a valid event identifier ? I should make some trick to try to liberate only part of the events just to see it that matches the behaviour of a bitmap :)

Now of course, that’s not a realistic approach to have thousands and thousands of events on flight at the same time, but to me, event creation would look like something that must be done in O(1) if we assume this is just some empty request that will be submitted to some stream afterward.

Thanks for you time,

Cédric