GPU Pro Tip: CUDA 7 Streams Simplify Concurrency

Originally published at:

Heterogeneous computing is about efficiently using all processors in the system, including CPUs and GPUs. To do this, applications must execute functions concurrently on multiple processors. CUDA Applications manage concurrency by executing asynchronous commands in streams, sequences of commands that execute in order. Different streams may execute their commands concurrently or out of order with…

The code samples use tid = threadIdx.x + blockIdx.x + blockDim.x instead of the more usual tid = threadIdx.x + blockIdx.x * blockDim.x. It seems that if more than one block were used, the tids would not be unique. Is this just a typo, or if it is intentional, can you explain the choice?

Who can explain the 4th tip:
You can create non-blocking streams which do not synchronize with the legacy default stream by passing the cudaStreamNonBlocking flag to cudaStreamCreate().

I think this is deliberate to prevent bank conflicts. Have a look at the loop increment section where i is incremented by

i += blockDim.x * gridDim.x

No, it was a typo, thanks for catching it Dan! I fixed the code (and verified the profiling results are the same). The increment is that way because this is a grid-stride loop (

I believe it should, since underneath those are just threads. I haven't tested it yet though (since clang on my macbook doesn't support OpenMP -- that was going to be my example initially. :)

non-blocking streams simply don't synchronize implicitly with the legacy default stream -- they have the opposite behavior to the default (legacy) behavior. This is explained in the docs:

The 6.5 programming guide states that a device memory allocation will serialize commands in different streams, yet you have one in your parallel function. Are you getting lucky that *all* calls to cudaMalloc are invoked before *any* concurrent kernel launch, or has this restriction been removed?

Good observation. I may indeed be getting lucky in this example -- however it's also pretty straightforward to make sure all allocations are done ahead of time (especially in the single thread, multi-stream case). And if you need higher performance and control over blocking, you could write a suballocator (multithreaded or otherwise) on top of a single large device memory allocation.

Hey is it possible to launch 2 different kernels on 2 devices concurrently from one CPU and how we can do it

Does this work also with NPP library? In this case how would we set the stream for NPP in particular host thread? nppSetStream(0) ?

Enabling PTDS for your compilation units doesn't enable it for libraries that are separately compiled (like NPP). So I think you need to call nppSetStream(cudaPerThreadStream) to make NPP use the per-thread default stream.

Sure, you can either do that using explicit streams in a single thread or per-thread default stream with multiple threads. On each stream/thread, call cudaSetDevice(x) and then launch the kernel for stream x (where x is a different device for each stream/thread).

Thanks, I have confirmed that nppSetStream(0) indeed does not work. How can I get cudaPerThreadStream? If I simply put nppSetStream(cudaPerThreadStream) I get a compile error "identifier "cudaPerThreadStream" is undefined"...

I had a typo. Try cudaStreamPerThread?

This works, thanks!

Excellent, glad to help.

Hello Mark,

Using the nvvp I noticed that even if I set Thrust to do a transform using a non-default stream, it will still show in the profile as if it is executing in the the default stream. Can you give me some advice in this respect?

/* Example: Using Thrust to convert from 8-bit to double using stream */
struct convert_byte_to_double : public thrust::unary_function<char, double="">
__host__ __device__
double operator()(const char& byte_value) {
return (double) byte_value;

thrust::device_ptr<double> double_devptr =
thrust::device_ptr<char> byte_devptr =

thrust::transform(thrust::cuda::par.on(*stream), byte_devptr,
byte_devptr + length, double_devptr, convert_byte_to_double());

In the profiler I will see the following kernel name associated with the default stream:

"void thrust::system::cuda::detail::bulk_::detail::launch_by_value<unsigned int="0," thrust::system::cuda::detail::bulk_::detail::cuda_task<thrust::system::cuda::detail::bulk_::parallel_group<thrust::system::cuda::detail::bulk_::concurrent_group<thrust::system::cuda::detail::bulk_::agent<unsigned="" long="1">, unsigned long=0>, unsigned long=0>, thrust::system::cuda::detail::bulk_::detail::closure<thrust::system::cuda::detail::for_each_n_detail::for_each_kernel, thrust::tuple<thrust::system::cuda::detail::bulk_::detail::cursor<unsigned="" int="0">, thrust::zip_iterator<thrust::tuple<thrust::device_ptr<char>, thrust::device_ptr<double>, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type>>, thrust::detail::wrapped_function<thrust::detail::unary_transform_functor<convert_byte_to_real>, void>, unsigned int, thrust::null_type, thrust::null_type, thrust::nul"


Hi Omar, that's not quite a complete sample, so I wouldn't be able to repro. How is "stream" defined in your program?

Hi Mark, the CUDA streams are run in separate OpenMP threads. To create the CUDA streams I use the following:

cudaStreamCreateWithFlags(&stream[i], cudaStreamNonBlocking);

I wrote a complete example below. Thank you for your time.

#include <thrust transform.h="">
#include <thrust device_vector.h="">
#include <thrust device_ptr.h="">
#include <thrust copy.h="">
#include <thrust system="" cuda="" execution_policy.h="">
#include <omp.h>
#include <stdlib.h> /* srand, rand */
#include <time.h> /* time */

#include <stdio.h>
#include <fstream>
#include <cfloat> /* DBL_MIN */

void testThrustStreams();

int main() {
return 0;

struct convert_byte_to_double: public thrust::unary_function<char, double=""> {
__host__ __device__
double operator()(const char& byte_value) {
return (double) byte_value;

void testThrustStreams() {

int size = 10;
int num_streams = 5;

char *byte_host[num_streams];
char *byte_dev[num_streams];
double *double_dev[num_streams];
double max_value[num_streams];


for (unsigned int i = 0; i < num_streams; i++) {
byte_host[i] = (char *) malloc(sizeof(char) * size);
for (int j = 0; j < size; j++) {
byte_host[i][j] = rand() % 255;
cudaMalloc(&byte_dev[i], size * sizeof(char));
cudaMemcpy(byte_dev[i], byte_host[i], size * sizeof(char),

/* CUDA streams and output buffers */
cudaStream_t stream[num_streams];
for (unsigned int i = 0; i < num_streams; i++) {
cudaMalloc(&double_dev[i], size * sizeof(double));
cudaStreamCreateWithFlags(&stream[i], cudaStreamNonBlocking);

#pragma omp parallel num_threads(num_streams)

int tid = omp_get_thread_num();

thrust::device_ptr<char> byte_devptr = thrust::device_pointer_cast(
thrust::device_ptr<double> double_devptr = thrust::device_pointer_cast(

thrust::transform(thrust::cuda::par.on(stream[tid]), byte_devptr,
byte_devptr + size, double_devptr, convert_byte_to_double());

max_value[tid] = thrust::reduce(thrust::cuda::par.on(stream[tid]),
double_devptr, double_devptr + size, DBL_MIN,


#pragma omp barrier

for (int i = 0; i < num_streams; i++) {
std::cout << i << " => max: " << max_value[i] << std::endl;

/** Cleanup memory **/
for (int i = 0; i < num_streams; i++) {