Get rid of busy waiting during asynchronous cuda stream executions

I looking for a way how to get rid of busy waiting in fallowing code (do not copy that code, it only shows an idea of my problem, it has many basic bugs):

cudaStream_t steams[S_N];

    for (int i = 0; i < S_N; i++) {

        cudaStreamCreate(streams[i]);

    }

    int sid = 0;

    for (int d = 0; d < DATA_SIZE; d+=DATA_STEP) {

         while(true) {

              if (cudaStreamQuery(streams[sid])) == cudaSuccess) { //BUSY WAITING !!!!

                   cudaMemcpyAssync(d_data, h_data + d, DATA_STEP, cudaMemcpyHostToDevice, streams[sid]);

                   kernel<<<gridDim, blockDim, smSize streams[sid]>>>(d_data, DATA_STEP);

                   break;

              }

              sid = ++sid % S_N;

         }

   }

Is there a way to idle host thread and wait somehow to some stream to finish, and then prepare and run another stream?

I put the same question here

I have asked similar question a couple of times. Nvidia people simply ignore it for some reason.

I’m not sure of what you’re looking. For what i know (of think to), there are 2 functions for querying a stream’s state:

1- cudaStreamQuery(my_stream) ==> immediate return with either cudaSuccess if stream has complete or cudaErrorNotReady if not
http://developer.download.nvidia.com/compute/cuda/3_2/toolkit/docs/online/group__CUDART__STREAM_ge78feba9080b59fe0fff536a42c13e6d.html#ge78feba9080b59fe0fff536a42c13e6d

2- cudaStreamSynchronize(my_stream) ==> block until stream has complete
http://developer.download.nvidia.com/compute/cuda/3_2/toolkit/docs/online/group__CUDART__STREAM_geb3b2f88b7c1cff8b67a998a3a41c179.html#geb3b2f88b7c1cff8b67a998a3a41c179

If you want synchornization between 2 différents stream, the way is the use of a cudaEvent on the first stream, and a cudaStreamWaitEvent on this event and your second stream.
http://developer.download.nvidia.com/compute/cuda/3_2/toolkit/docs/online/group__CUDART__STREAM_gfe68d207dc965685d92d3f03d77b0876.html#gfe68d207dc965685d92d3f03d77b0876

hope this helps

I want to dynamically distribute job over the streams. When some stream become idle, completed all of its commands, I want to schedule to that stream new job (list of commands like: copy, kernel execution).

I did something like that:

cudaStream_t steams[S_N];

for (int i = 0; i < S_N; i++) {

    cudaStreamCreate(streams[i]);

}

int sid = 0;

for (int d = 0; d < DATA_SIZE; d+=DATA_STEP) {

    cudaMemcpyAssync(d_data, h_data + d, DATA_STEP, cudaMemcpyHostToDevice, streams[sid]);

    kernel<<<gridDim, blockDim, smSize streams[sid]>>>(d_data, DATA_STEP);

    sid = ++sid % S_N;

}

for (int i = 0; i < S_N; i++) {

    cudaStreamSynchronize(streams[i]);

    cudaStreamDestroy(streams[i]);

}

But it statically distribute the chunks of work, and when some of the streams finishes its work it becomes idle. I tested and it appears that code with busy waiting, and checking whether stream finished its work and schedule new job for it is more efficient than static distribution. But all I need is to get rid of busy waiting.

The only solution which comes to my mind is to use one host thread per stream, that thread could invoke cudaStreamSynchronize on particular stream, and then schedule new job for that stream. But I am not if it the only one solution.

My idea to solve that problem is to have one host thread per one stream. That host thread would invoke cudaStreamSynchronize to wait till the stream commands are completed. Unfortunately it is not possible in CUDA 3.2 since it allows only one host thread deal with one CUDA context, it means one host thread per one CUDA enabled GPU.

Hopefully, in CUDA 4.0 it will be possible: CUDA 4.0 RC news

kogut,

And what about your code that you’ve published on stackoverflow ?

cudaStream_t steams[S_N];

for (int i = 0; i < S_N; i++) {

    cudaStreamCreate(streams[i]);

}

cudaEvent_t event;

cudaEventCreate(&event);

for (int d = 0; d < DATA_SIZE; d+=DATA_STEP) {

     bool wasFreeStream = false;

     for (int sid = 0, sid < S_M; sid++) {

         if (cudaStreamQuery(streams[sid])) == cudaSuccess) { //BUSY WAITING !!!!

             cudaMemcpyAssync(d_data, h_data + d, DATA_STEP, cudaMemcpyHostToDevice, streams[sid]);

             kernel<<<gridDim, blockDim, smSize streams[sid]>>>(d_data, DATA_STEP);

             cudaEventRecord(event, stream[sid]);

wasFreeStream = true;

             break;

         }

     }

if (wasFree) {

         d+=DATA_STEP;

     } else {

         cudaEventSynchronize(event);

     }

}

Is it wrong or what ?

Yes, it is wrong, I have deleted it from stackoverflow. I do not know if you have seen comments to that answer. It appears that event is recorded after the last kernel finishes its work, so still I do not know which of the streams is finished as a first.

Understood. It is very strange to me that such kind of task is not implementable without significant efforts … and, which is more frustrating, none of Nvidia guys ever answered on this sort of questions. The whole idea of streams seems compromised if it is impossible (at least before CUDA 4.0) to dynamically load the set of pre-created streams as soon as one of them becomes idle. It is also not a pleasure to maintain a bunch of per-stream threads, especially when all already implemented CUDA-based code relies on the “one thread-one GPU” rule.