limit on number of events linked to a stream via cudaStreamWaitEvent()..?

hello,

well, is there are limit on the number of events one can make a stream wait on, through cudaStreamWaitEvent()?

i assume multiple events - multiple cudaStreamWaitEvent() issued in sequence - is in order

the guides are unclear about this

At any given moment in time, a stream can either be waiting on zero events or one event. Any other CUDA activities issued to a particular stream are simply queued up (and will become active when all previous CUDA activity issued to that stream is complete). These internal queues do have limits, that are not published as far as I know, but they are typically on the order of 1000-10000 pending CUDA calls, based on my own testing. I have not tested with 1000 to 10000 discrete events, however.

noted, thanks

i can do with more than one, but doubt whether i would exceed 10 or 100 at the most
thus, seems i am in luck

out of curiosity, how did you test?
i tried to think of a test, but failed to come up with one
how would you test - does the current cuda call start to fail or poll when the queue is full?

Yes, CUDA calls that would normally be non-blocking become blocking when the internal queue is “full”.

The method of determining this can be somewhat tricky, but if you are creating just a simple test case, it’s fairly easy to spot with a few choice printf statements - because the previously non-blocking call becomes blocking, so your code appears to “hang up” at that point - until queue slots free up.

Here’s a simple test. Not sure it’s exactly what you had in mind, and may not have the granularity you are looking for, but it demonstrates the “queue full effect”:

$ cat t762.cu
#include <stdio.h>

#define KDELAY 10000000000ULL

#define cudaCheckErrors(msg) \
    do { \
        cudaError_t __err = cudaGetLastError(); \
        if (__err != cudaSuccess) { \
            fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
                msg, cudaGetErrorString(__err), \
                __FILE__, __LINE__); \
            fprintf(stderr, "*** FAILED - ABORTING\n"); \
            exit(1); \
        } \
    } while (0)

__global__ void mykernel(unsigned long long delta){

  unsigned long long mytime = clock64();
  while (clock64() < mytime+delta);
}

int main(int argc, char *argv[]){

  int num_events = 5;
  clock_t start = clock();
  if (argc == 2) {num_events = atoi(argv[1]);}
  cudaEvent_t *my_event = new cudaEvent_t[num_events];
  for (int i = 0; i < num_events; i++) cudaEventCreate(&(my_event[i]));
  cudaCheckErrors("event create error");
  cudaStream_t my_stream;
  cudaStreamCreate(&my_stream);
  cudaCheckErrors("stream create error");
  mykernel<<<1,1,0,my_stream>>>(KDELAY);
  for (int i = 0; i < num_events; i++) cudaEventRecord(my_event[i], my_stream);
  cudaCheckErrors("event record error");
  for (int i = 0; i < num_events; i++) cudaStreamWaitEvent(my_stream, my_event[i], 0);
  cudaCheckErrors("stream wait event error");
  clock_t end = clock();
  printf("got here after %f seconds\n", (end-start)/(float)CLOCKS_PER_SEC);
  cudaDeviceSynchronize();
  cudaCheckErrors("some error");
  clock_t end2 = clock();
  printf("finished after %f seconds\n", (end2-start)/(float)CLOCKS_PER_SEC);

  return 0;
}

$ nvcc -o t762 t762.cu
$ ./t762
got here after 0.454109 seconds
finished after 10.183502 seconds
$ ./t762 1000
got here after 0.468267 seconds
finished after 10.200411 seconds
$ ./t762 10000
got here after 10.254971 seconds
finished after 10.255035 seconds
$

There’s no significant difference in behavior when we ask for either 5 or 1000 events to be queued up. But when we ask for 10000 events to be queued up, somewhere along the way, a previously non-blocking CUDA call becomes blocking, because an internal queue is “full”. This full condition does not get relieved until the kernel actually completes, and so the “got here” printf statement that normally prints out shortly after program start does not print out until shortly before program end.

i have previously issued multiple such cudaStreamWaitEvent() for a stream

today, when doing it again, it dawned upon me that i might be assuming that it is indeed valid to do so - to have a stream wait on more than one event - given that it is events

i briefly thought that the cudaStreamWaitEvent() api may be elevated, as it links in and incorporates an event, but from your comments i seems that this is not the case - cudaStreamWaitEvent() is (seen as) an ordinary api, without any special ‘privileges’ or requirements

are the queue depths you report per stream, or per device (for all streams)?

A stream cannot wait on more than one event at any given moment.

If you ask it to wait on event A, it will (at some point) wait on event A.
If you ask it to wait on event A, then immediately ask it to wait on event B, it will:

  1. at some point, begin to wait on event A (only)
  2. when event A is satisfied, begin waiting on event B (only)

If you wish to generalize the above behavior and say “I am having that stream wait on event A and event B”, then feel free to do so. But your questions seem to be more specific than that.

Mostly I’m reporting my own empirical data from which a logical conclusion might be drawn that there is a queue, and that it seems to have a “depth” along the lines of what I described. I don’t know the inner workings of the GPU to that level of detail (where exactly all the queues are, and what is the depth of each), and it’s not really my intent to provide unpublished data in this context. Occasionally, some other NV folks like Greg @ NV may come along and give this information - it’s their prerogative to do so. It’s not really my prerogative, and more importantly I don’t know these details anyway.

queue depth numbers, if there are any, are unpublished. Along with being unpublished, the granularity of per stream or per device is also unpublished (I think…). Coupled with them being unpublished, I don’t happen to know what they are.

An extension to the kind of test I created above might allow you to investigate whether there is per-stream or per-device association with any of the observations.

i suppose the general/ ordinary use case is simply to have a stream wait on an event:

cudaStreamWaitEvent(sx, ex, 0);

i use several streams, for a number of reasons, like to forward issue the work, and in some cases, one or more streams then need to wait on more than one event, to preserve stream synchronization and prevent stream-induced races

cudaStreamWaitEvent(sx, e[0], 0);

cudaStreamWaitEvent(sx, e[1], 0);

cudaStreamWaitEvent(sx, e[2], 0);

or:

for (count = 0; count < y; c++)
{
cudaStreamWaitEvent(sx, e[count], 0);
}

with multiple streams, i suppose it looks like:

for county = 0; county < z; county++)
{
for (count = 0; count < y; count++)
{
cudaStreamWaitEvent(s[county], e[count], 0);
}
}

although this is a diluted and simplistic case

i do not mind whether a stream only waits on one event at a time (rather superfluous if a stream can only continue once all events are completed, which is the imposed condition), as long as all events are satisfied

i use multiple streams, but hardly more than 10 - 20
the number of events is normally a multiple of the number of streams used; but the multiple is small (5 at most)
kernels and memory copies, etc in issuance per stream at a point in time are also limited
i am trying to confirm that i am still in the green in terms of the amount of work i issue across the streams, given the known queue (blocking) phenomenon, and its likely depth - the point at which this occurs
i seem to be fine in this regard
and you are right, i should be able to extend your sample test and confirm this empirically