4.0 RC - many host threads per one GPU - cudaStreamQuery and cudaStreamSynchronize behaviour.

kogut · March 8, 2011, 2:42pm

Hi,

I wrote a code which uses many host (openMP) threads per one GPU. Each thread has its own CUDA stream to order it requests. It looks very similar to below code:

#pragma omp parallel for num_threads(STREAM_NUMBER)

for (int sid = 0; sid < STREAM_NUMBER; sid++) {

    cudaStream_t stream;

    cudaStreamCreate(&stream);

while (hasJob()) {

//... code to prepare job - dData, hData, dataSize etc

cudaError_t streamStatus = cudaStreamQuery(stream);

        if (streamStatus == cudaSuccess) {

             cudaMemcpyAsync(dData, hData, dataSize, cudaMemcpyHostToDevice, stream);

             doTheJob<<<gridDim, blockDim, smSize, stream>>>(dData, dataSize);

        else {

             CUDA_CHECK(streamStatus);

        }

        cudaStreamSynchronize();

    }

    cudaStreamDestroy(stream);

}

And everything were good till I got many small jobs. In that case, from time to time, cudaStreamQuery returns cudaErrorNotReady, which is for me unexpected because I use cudaStreamSynchronize. Till now I were thinking that cudaStreamQuery will always return cudaSuccess if it is called after cudaStreamSynchronize. Unfortunately it appeared that cudaStreamSynchronize may finish even when cudaStreamQuery still returns cudaErrorNotReady.

I changed the code into fallowing and everything works correctly.

#pragma omp parallel for num_threads(STREAM_NUMBER)

for (int sid = 0; sid < STREAM_NUMBER; sid++) {

    cudaStream_t stream;

    cudaStreamCreate(&stream);

while (hasJob()) {

//... code to prepare job - dData, hData, dataSize etc

cudaError_t streamStatus;

        while ((streamStatus = cudaStreamQuery(stream)) == cudaErrorNotReady) {

             cudaStreamSynchronize();

        }

        if (streamStatus == cudaSuccess) {

             cudaMemcpyAsync(dData, hData, dataSize, cudaMemcpyHostToDevice, stream);

             doTheJob<<<gridDim, blockDim, smSize, stream>>>(dData, dataSize);

        else {

             CUDA_CHECK(streamStatus);

        }

        cudaStreamSynchronize();

    }

    cudaStreamDestroy(stream);

}

So my question… is it a bug or a feature?

EDIT: I asked the same question here: openmp - CUDA 4.0 RC - many host threads per one GPU - cudaStreamQuery and cudaStreamSynchronize behaviour - Stack Overflow

seibert · March 8, 2011, 3:57pm

This does look pretty weird, and seems like it has to be a bug.

One other option you could investigate is whether you could solve this by adding cudaEvent to the stream after the kernel launch, and then cudaEventSynchronize on that event.

kogut · March 8, 2011, 4:12pm

I checked cudaEvent, and after cudaEventSynchronize synchronization cudaStreamQuery allways returns true, as it is expected. So it has to be bug, unfortunately.

tmurray · March 8, 2011, 5:09pm

Can you post a full repro?

kogut · March 9, 2011, 12:48pm

Here you have my source code: nvidia_forum.tgz download - 2shared

I am running it on Linux 64 - open SUSE. Additionally, I added output files from CUDA SDK deviceQuery and /proc/cpuinfo.

To compile it you need to have installed cmake in version at least 2.8 and openMP library.

To compile it, type:

cd nvidia_forum/histogram/

./bootstrap

make

To run it, and see the described behaviour, type:

./build/histogram -d 10000 -a 9

./build/histogram -d 1000000 -a 9

If you have any problem with reproducing the bug or you have expected something different, just tell me.

Thanks.

Christopher_Cameron · March 10, 2011, 12:47am

Yes, this is a new bug. In particular, if you call cudaStreamQuery() on a cudaStream_t before any work is enqueued into it, you may get an unexpected cudaErrorNotYetReady (this is reflecting internal state). Similarly, if you call cudaStreamSynchronize(), the call may not return immediately (it may wait for a little while before returning). Once any work has been enqueued into the stream (a kernel, memcpy, or event record), you will get the expected results from cudaStreamQuery()/cudaStreamSynchronize() for the remainder of the cudaStream_t’s lifetime.

This bug should be fixed in the next RC. Until then, you can work around this either by calling cudaStreamSynchronize() immediately after calling cudaStreamCreate() (at the expense of some performance), or by just ignoring the result of cudaStreamQuery() until you know that you have enqueued work (a kernel, memcpy, or event record) into the stream (somewhat more complicated, but with no performance impact).

Topic		Replies	Views
Streams and multi-gpu CUDA Programming and Performance	10	2191	June 17, 2014
Kernels launched by multiple host threads get serialized by cudaStreamSynchronize(0) when --default- CUDA Programming and Performance	7	2907	October 12, 2021
Fail to sync the cudaMemcpyAsync using the cudaEvent in two streams CUDA Programming and Performance	4	253	April 1, 2024
My streams are not running concurrently CUDA Programming and Performance	7	1803	March 6, 2018
Streams and CPU CUDA Programming and Performance	1	1036	September 27, 2013
cudaStreamSynchronize is much slower than polling on a flag for kernel completion CUDA Programming and Performance cuda , synchronization	8	2175	February 16, 2023
Get rid of busy waiting during asynchronous cuda stream executions CUDA Programming and Performance	7	2786	March 15, 2011
Multi stream multi GPU CUDA Programming and Performance cuda	9	1191	October 6, 2023
Cannot force kernels to concurrent execution CUDA Programming and Performance	8	5555	April 28, 2012
Why does cudaStreamAddCallback serialize kernel execution and break concurrency? CUDA Programming and Performance	12	8093	April 5, 2015

4.0 RC - many host threads per one GPU - cudaStreamQuery and cudaStreamSynchronize behaviour.

Related topics