4.0 RC - many host threads per one GPU - cudaStreamQuery and cudaStreamSynchronize behaviour.

Hi,

I wrote a code which uses many host (openMP) threads per one GPU. Each thread has its own CUDA stream to order it requests. It looks very similar to below code:

#pragma omp parallel for num_threads(STREAM_NUMBER)

for (int sid = 0; sid < STREAM_NUMBER; sid++) {

    cudaStream_t stream;

    cudaStreamCreate(&stream);

while (hasJob()) {

//... code to prepare job - dData, hData, dataSize etc

cudaError_t streamStatus = cudaStreamQuery(stream);

        if (streamStatus == cudaSuccess) {

             cudaMemcpyAsync(dData, hData, dataSize, cudaMemcpyHostToDevice, stream);

             doTheJob<<<gridDim, blockDim, smSize, stream>>>(dData, dataSize);

        else {

             CUDA_CHECK(streamStatus);

        }

        cudaStreamSynchronize();

    }

    cudaStreamDestroy(stream);

}

And everything were good till I got many small jobs. In that case, from time to time, cudaStreamQuery returns cudaErrorNotReady, which is for me unexpected because I use cudaStreamSynchronize. Till now I were thinking that cudaStreamQuery will always return cudaSuccess if it is called after cudaStreamSynchronize. Unfortunately it appeared that cudaStreamSynchronize may finish even when cudaStreamQuery still returns cudaErrorNotReady.

I changed the code into fallowing and everything works correctly.

#pragma omp parallel for num_threads(STREAM_NUMBER)

for (int sid = 0; sid < STREAM_NUMBER; sid++) {

    cudaStream_t stream;

    cudaStreamCreate(&stream);

while (hasJob()) {

//... code to prepare job - dData, hData, dataSize etc

cudaError_t streamStatus;

        while ((streamStatus = cudaStreamQuery(stream)) == cudaErrorNotReady) {

             cudaStreamSynchronize();

        }

        if (streamStatus == cudaSuccess) {

             cudaMemcpyAsync(dData, hData, dataSize, cudaMemcpyHostToDevice, stream);

             doTheJob<<<gridDim, blockDim, smSize, stream>>>(dData, dataSize);

        else {

             CUDA_CHECK(streamStatus);

        }

        cudaStreamSynchronize();

    }

    cudaStreamDestroy(stream);

}

So my question… is it a bug or a feature?

EDIT: I asked the same question here: openmp - CUDA 4.0 RC - many host threads per one GPU - cudaStreamQuery and cudaStreamSynchronize behaviour - Stack Overflow

This does look pretty weird, and seems like it has to be a bug.

One other option you could investigate is whether you could solve this by adding cudaEvent to the stream after the kernel launch, and then cudaEventSynchronize on that event.

I checked cudaEvent, and after cudaEventSynchronize synchronization cudaStreamQuery allways returns true, as it is expected. So it has to be bug, unfortunately.

Can you post a full repro?

Here you have my source code: nvidia_forum.tgz download - 2shared

I am running it on Linux 64 - open SUSE. Additionally, I added output files from CUDA SDK deviceQuery and /proc/cpuinfo.

To compile it you need to have installed cmake in version at least 2.8 and openMP library.

To compile it, type:

cd nvidia_forum/histogram/

./bootstrap

make

To run it, and see the described behaviour, type:

./build/histogram -d 10000 -a 9

./build/histogram -d 1000000 -a 9

If you have any problem with reproducing the bug or you have expected something different, just tell me.

Thanks.

Yes, this is a new bug. In particular, if you call cudaStreamQuery() on a cudaStream_t before any work is enqueued into it, you may get an unexpected cudaErrorNotYetReady (this is reflecting internal state). Similarly, if you call cudaStreamSynchronize(), the call may not return immediately (it may wait for a little while before returning). Once any work has been enqueued into the stream (a kernel, memcpy, or event record), you will get the expected results from cudaStreamQuery()/cudaStreamSynchronize() for the remainder of the cudaStream_t’s lifetime.

This bug should be fixed in the next RC. Until then, you can work around this either by calling cudaStreamSynchronize() immediately after calling cudaStreamCreate() (at the expense of some performance), or by just ignoring the result of cudaStreamQuery() until you know that you have enqueued work (a kernel, memcpy, or event record) into the stream (somewhat more complicated, but with no performance impact).