GPU Pro Tip: CUDA 7 Streams Simplify Concurrency

jwitsoe · January 23, 2015, 3:47am

Originally published at: https://developer.nvidia.com/blog/gpu-pro-tip-cuda-7-streams-simplify-concurrency/

Heterogeneous computing is about efficiently using all processors in the system, including CPUs and GPUs. To do this, applications must execute functions concurrently on multiple processors. CUDA Applications manage concurrency by executing asynchronous commands in streams, sequences of commands that execute in order. Different streams may execute their commands concurrently or out of order with…

anon93675426 · January 23, 2015, 5:12pm

The code samples use tid = threadIdx.x + blockIdx.x + blockDim.x instead of the more usual tid = threadIdx.x + blockIdx.x * blockDim.x. It seems that if more than one block were used, the tids would not be unique. Is this just a typo, or if it is intentional, can you explain the choice?

anon51415268 · January 24, 2015, 9:43am

Who can explain the 4th tip:
You can create non-blocking streams which do not synchronize with the legacy default stream by passing the cudaStreamNonBlocking flag to cudaStreamCreate().

anon31870422 · January 24, 2015, 4:09pm

I think this is deliberate to prevent bank conflicts. Have a look at the loop increment section where i is incremented by

i += blockDim.x * gridDim.x

anon95180265 · January 27, 2015, 10:24pm

No, it was a typo, thanks for catching it Dan! I fixed the code (and verified the profiling results are the same). The increment is that way because this is a grid-stride loop (http://devblogs.nvidia.com/...

anon95180265 · January 27, 2015, 10:26pm

I believe it should, since underneath those are just threads. I haven't tested it yet though (since clang on my macbook doesn't support OpenMP -- that was going to be my example initially. :)

anon95180265 · January 27, 2015, 10:28pm

non-blocking streams simply don't synchronize implicitly with the legacy default stream -- they have the opposite behavior to the default (legacy) behavior. This is explained in the docs: http://docs.nvidia.com/cuda...

anon10013788 · February 17, 2015, 8:43pm

The 6.5 programming guide states that a device memory allocation will serialize commands in different streams, yet you have one in your parallel function. Are you getting lucky that *all* calls to cudaMalloc are invoked before *any* concurrent kernel launch, or has this restriction been removed?

anon95180265 · February 19, 2015, 2:05am

Good observation. I may indeed be getting lucky in this example -- however it's also pretty straightforward to make sure all allocations are done ahead of time (especially in the single thread, multi-stream case). And if you need higher performance and control over blocking, you could write a suballocator (multithreaded or otherwise) on top of a single large device memory allocation.

anon87407878 · May 2, 2015, 9:03am

Hey is it possible to launch 2 different kernels on 2 devices concurrently from one CPU and how we can do it

anon65602223 · May 20, 2015, 2:39pm

Does this work also with NPP library? In this case how would we set the stream for NPP in particular host thread? nppSetStream(0) ?

anon95180265 · May 21, 2015, 10:06pm

Enabling PTDS for your compilation units doesn't enable it for libraries that are separately compiled (like NPP). So I think you need to call nppSetStream(cudaPerThreadStream) to make NPP use the per-thread default stream.

anon95180265 · May 21, 2015, 10:08pm

Sure, you can either do that using explicit streams in a single thread or per-thread default stream with multiple threads. On each stream/thread, call cudaSetDevice(x) and then launch the kernel for stream x (where x is a different device for each stream/thread).

anon65602223 · May 22, 2015, 6:16am

Thanks, I have confirmed that nppSetStream(0) indeed does not work. How can I get cudaPerThreadStream? If I simply put nppSetStream(cudaPerThreadStream) I get a compile error "identifier "cudaPerThreadStream" is undefined"...

anon95180265 · May 24, 2015, 11:55pm

I had a typo. Try cudaStreamPerThread?

anon65602223 · May 25, 2015, 10:36am

This works, thanks!

anon95180265 · May 26, 2015, 1:25am

Excellent, glad to help.

anon84996908 · May 29, 2015, 2:37pm

Hello Mark,

Using the nvvp I noticed that even if I set Thrust to do a transform using a non-default stream, it will still show in the profile as if it is executing in the the default stream. Can you give me some advice in this respect?

/* Example: Using Thrust to convert from 8-bit to double using stream */
struct convert_byte_to_double : public thrust::unary_function<char, double="">
{
__host__ __device__
double operator()(const char& byte_value) {
return (double) byte_value;
}
};

thrust::device_ptr<double> double_devptr =
thrust::device_pointer_cast(&double_dev[0]);
thrust::device_ptr<char> byte_devptr =
thrust::device_pointer_cast(&byte_dev[0]);

thrust::transform(thrust::cuda::par.on(*stream), byte_devptr,
byte_devptr + length, double_devptr, convert_byte_to_double());

In the profiler I will see the following kernel name associated with the default stream:

"void thrust::system::cuda::detail::bulk_::detail::launch_by_value<unsigned int="0," thrust::system::cuda::detail::bulk_::detail::cuda_task<thrust::system::cuda::detail::bulk_::parallel_group<thrust::system::cuda::detail::bulk_::concurrent_group<thrust::system::cuda::detail::bulk_::agent<unsigned="" long="1">, unsigned long=0>, unsigned long=0>, thrust::system::cuda::detail::bulk_::detail::closure<thrust::system::cuda::detail::for_each_n_detail::for_each_kernel, thrust::tuple<thrust::system::cuda::detail::bulk_::detail::cursor<unsigned="" int="0">, thrust::zip_iterator<thrust::tuple<thrust::device_ptr<char>, thrust::device_ptr<double>, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type>>, thrust::detail::wrapped_function<thrust::detail::unary_transform_functor<convert_byte_to_real>, void>, unsigned int, thrust::null_type, thrust::null_type, thrust::nul"

Thanks!

anon95180265 · May 31, 2015, 10:01pm

Hi Omar, that's not quite a complete sample, so I wouldn't be able to repro. How is "stream" defined in your program?

anon84996908 · June 1, 2015, 10:50am

Hi Mark, the CUDA streams are run in separate OpenMP threads. To create the CUDA streams I use the following:

cudaStreamCreateWithFlags(&stream[i], cudaStreamNonBlocking);

I wrote a complete example below. Thank you for your time.

#include <thrust transform.h="">
#include <thrust device_vector.h="">
#include <thrust device_ptr.h="">
#include <thrust copy.h="">
#include <thrust system="" cuda="" execution_policy.h="">
#include <omp.h>
#include <stdlib.h> /* srand, rand */
#include <time.h> /* time */

#include <stdio.h>
#include <fstream>
#include <cfloat> /* DBL_MIN */

void testThrustStreams();

int main() {
testThrustStreams();
return 0;
}

struct convert_byte_to_double: public thrust::unary_function<char, double=""> {
__host__ __device__
double operator()(const char& byte_value) {
return (double) byte_value;
}
};

void testThrustStreams() {

int size = 10;
int num_streams = 5;

char *byte_host[num_streams];
char *byte_dev[num_streams];
double *double_dev[num_streams];
double max_value[num_streams];

srand(time(NULL));

for (unsigned int i = 0; i < num_streams; i++) {
byte_host[i] = (char *) malloc(sizeof(char) * size);
for (int j = 0; j < size; j++) {
byte_host[i][j] = rand() % 255;
}
cudaMalloc(&byte_dev[i], size * sizeof(char));
cudaMemcpy(byte_dev[i], byte_host[i], size * sizeof(char),
cudaMemcpyHostToDevice);
}

/* CUDA streams and output buffers */
cudaStream_t stream[num_streams];
for (unsigned int i = 0; i < num_streams; i++) {
cudaMalloc(&double_dev[i], size * sizeof(double));
cudaStreamCreateWithFlags(&stream[i], cudaStreamNonBlocking);
}

#pragma omp parallel num_threads(num_streams)
{

int tid = omp_get_thread_num();

thrust::device_ptr<char> byte_devptr = thrust::device_pointer_cast(
&byte_dev[tid][0]);
thrust::device_ptr<double> double_devptr = thrust::device_pointer_cast(
&double_dev[tid][0]);

thrust::transform(thrust::cuda::par.on(stream[tid]), byte_devptr,
byte_devptr + size, double_devptr, convert_byte_to_double());

max_value[tid] = thrust::reduce(thrust::cuda::par.on(stream[tid]),
double_devptr, double_devptr + size, DBL_MIN,
thrust::maximum<double>());

}

#pragma omp barrier

for (int i = 0; i < num_streams; i++) {
std::cout << i << " => max: " << max_value[i] << std::endl;
}

/** Cleanup memory **/
for (int i = 0; i < num_streams; i++) {
free(byte_host[i]);
cudaFree(byte_dev[i]);
cudaFree(double_dev[i]);
}

}

Topic		Replies	Views
Cannot get any stream parallelism. CUDA Programming and Performance	13	1264	December 31, 2019
How to Overlap Data Transfers in CUDA C/C++ Technical Blog	23	2189	January 18, 2023
Time intervals and non-concurrent in multi streaming CUDA Programming and Performance cuda	6	568	April 6, 2023
Performances of multi-thread vs multi-process with MPS CUDA Programming and Performance	2	3007	August 20, 2018
What can't you do in CUDA that you'd like? Requests for the future CUDA Programming and Performance	407	134557	May 26, 2010
Performance drops with dynamic parallelism CUDA Programming and Performance cuda , dynamic-control	12	533	June 3, 2024
Why does cudaStreamAddCallback serialize kernel execution and break concurrency? CUDA Programming and Performance	12	7969	April 5, 2015
CUDA very slow performance CUDA Programming and Performance	21	16518	March 6, 2020
Overlapping CPU and GPU operations using streams. Total failure. Any help? CUDA Programming and Performance	6	5996	April 2, 2013
An Even Easier Introduction to CUDA Technical Blog	141	6130	November 28, 2023

GPU Pro Tip: CUDA 7 Streams Simplify Concurrency

Related topics