Adding CUDA streams to threaded software


I have an application that I wrote using CUDA 3.2 to perform some calculations. I did NOT use any asynchronus API calls or anything of the sort in that CUDA app, just normal memcpy, and sequential kernel execution types of operations.

To achieve better performance, in my host code, I have spawned multiple pthreads, and created a new context in those worker threads. The problem is this context takes up a lot of space, so I have been investigating using streams.

What is the best strategy here? If I go back through my CUDA code, and simply add a stream parameter for all the calls to CUDA memory operations and kernels, such that class instance that contains my CUDA code all operates on the SAME stream, it will basically run sequentially as I hve it written now?

Would my worker threads be OK to all push the same cuda context? If they all pushed the same context, and passed a different stream to the CUDA class I have modified to use that stream would that allow me to get concurrency among the worker threads?

Thanks for any tips or resources that you can point me to.

The examples in the SDK seem to be lacking details on some of this stuff.

  1. Install CUDA 4.0, which makes the whole API thread-safe and enables sharing of contexts transparently between threads.
  2. Use the same context in every thread (or cudaSetDevice(sameDeviceID) in every thread, which gets you the same context as of 4.0).
  3. Use a different stream in every thread.
  4. Use cudaStreamWaitEvent/cuStreamWaitEvent to synchronize between streams.
  5. ???
  6. Profit.

Related to 3) Does this imply that I will have to modify my CUDA-enabled library to have a stream parameter for all the memory operations and kernel invocations?

Related to 4) All my invokations of my cuda-enabled library should be independent of each other. I would expect that each call to “the library” would be on the same stream, so do I need to synchronize anything? It seems like presently, if I wanted to think of it this way - all my of the CUDA in my library is sequential, within the default stream, right?

  1. Assuming every thread contains a logical series of operations that are independent from one another, then yes, you would need to do that. The default stream in 4.0 still causes process-wide serialization, so you need to manually manage streams.

  2. If the library calls are truly independent from one another, there’s no shared state between the streams, etc., then no, you don’t need to do any sort of synchronization with StreamWaitEvent. You would just probably want to make sure that you synchronize with the CPU using cudaStreamSynchronize or something like that instead of cudaDeviceSynchronize.

Thanks for the input so far.

Can you elaborate on how the worker threads come to know their context? I was under the impression with CUDA4 I could simply create a new stream in that thread, and it would automatically know the context.

In reality, do I still need to pop the context from the main thread, and “push” the context from each of the worker threads?

Does this need to happen once per worker thread, or do I need to pass the context around like a token to each thread that needs it? If so, that kind of defeats the purpose of what I am trying to get done, and I’d be better off just creating a new context in each worker thread.

I tried to use “cudaSetDevice(0)” to set to the same device that the main host thread has. I follow that with a cuCtxGetCurrent() which gives me a context, but that context IS different than the original context on the main host thread.

I am having similar problem, I have everything set up as you describe, each thread has its own stream, function, and input/output buffers (but not a CUmodule this is shared between threads).

At the start of the thread function I call cuCtxPushCurrent to make the CUDA context current for each thread. All the CUDA functions that can be (cuMemcpy and cuLaunchGrid) are called in asynchronous mode passing in the stream for the current thread. Each thread then waits for its stream using cuStreamSynchronize before processing the data that was returned.

Everything seems to work except for occasional problems where the data I’m getting back appears to be invalid, these go away if I wrap all my CUDA functions in a critical section.

My concern is with the functions that don’t take a cuStream parameter such as cuParamSet and cuFuncSetBlockShape are these thread safe? Is there a description somewhere in the docs of CUDA thread policy ? I can’t seem to find any reference to what functions are, and are not, thread safe.

Thanks all

This was indeed the issue, couldn’t find anything in the main doc’s but there was a little snippet in the release notes for 4.0:

If convert from cuLaunchGridAsync, cuParamSet, cuFuncSetBlockShape, etc. to cuLaunchKernel then my threading issues go away.

Interesting. I may look into this for my code as well. I hadn’t changed it as of yet, but I have the same symptoms as you, even using different contexts (not worrying about streaming). My kernels are launched asynchronus from the main thread, and end up with bad results occasionally.