GPU Pro Tip: CUDA 7 Streams Simplify Concurrency

Could it be this issue? https://github.com/thrust/t... It's fixed in the thrust trunk, but not in the version included in CUDA 7. You could try pulling the latest from github to compare.

Indeed that fixed it. Thanks!

how could i enable this function in visual stdio 2012 cudaruntime 7.0 soulution. I define define the CUDA_API_PER_THREAD_DEFAULT_STREAMpreprocessor macro before I include CUDA headers kernel.cu file. but it not works. I want to konw why ?

I don't have a windows machine in front of me, but I know that Visual Studio let's you add custom command line flags. So you could set -default-stream-per-thread on the command line. But the env variable should work too -- without seeing your code/project, I can't tell for sure why it's not working for you. Did you set it in one of the files, or in the project settings? I would set it in the environment box of the CUDA C++ project settings dialog.

I seti it in the project. project->properties->cuda c/c++->host->preprocessor defenitions->CUDA_API_PER_THREAD_DEFAULT_STREAM

the code is the same as you, and the code is in a .cu file.

i wonder if the method works in windows?i have the same questions as wei that i can't enable this function in vs2010.

It should work -- I realized there was a typo in one spot where I had `--default-stream-per-thread` when it should be `--default-stream per-thread` (note space). Can you make sure you set the flag correctly?

Hello, I'm not sure if this post is for this kind of question, if not please excuse me. I am compiling Caffe with cuDNN for their use with DIGITS. I got this error:

src/caffe/layers/cudnn_conv_layer.cu(56): error: identifier "cudaStreamLegacy" is undefined

src/caffe/layers/cudnn_conv_layer.cu(137): error: identifier "cudaStreamLegacy" is undefined

Could you please give any advise about it?

Thanks a lot.

Make sure you have CUDA 7.0 or later -- cudaStreamLegacy is new in CUDA 7.

I installed this two versions:

cuda-repo-ubuntu1504-7-5-local_7.5-18_amd64.deb
cudnn-7.0-linux-x64-v3.0-prod.tgz

Hi Mark,

I am trying an spmv, so far it works fine with matrix that can fit into the GPU memory. I am about to try larger matrix which is larger than my GPU memory 4GB.
Can the streams be used in that scenario? At the moment I am passing the CSR's to the GPU and then once the GPU has the CSR's it performs the SPMV. Not sure how that would be possible with streams?

Hi Mark,

I'm a little late to this, but does this apply to thrust? i.e with CUDA 7, running thrust algorithms on different threads means those algorithms run using their per-thread default stream?

EDIT: Nevermind, I see Omar's question below.

My CUDA version is 7.5, I am using Visual Studio 2013, device is GTX 850M but when I run the program, nvvp shows no timeline. What can be the problem ?

And when I run nvprof, it says :

==8128== NVPROF is profiling process 8128, command: CudaTest.exe
==8128== Profiling application: CudaTest.exe
==8128== Profiling result:
No kernels were profiled.

==8128== API calls:
No API activities were profiled.

I test the first one code on my PC, in the linux, it works fine. But in the windows, with VS2013 the stream doesn't run concurrently, with the "default-stream per-thread" flag. Also, I compile the code in the command-line "nvcc –default-stream per-thread ./stream_test.cu -o stream_per-thread". It doesn't work concurrently too.

dear Mr. Harris,

I have try out your instructions above and then I come an idea like this.

void threadExecute(void *input_data, int nx)
{
cufftComplex *data = (cufftComplex*)input_data;
cufftHandle plan;
cufftPlan1d(&plan, nx, CUFFT_C2C, 1);
cufftSetStream(plan, this_stream);
cufftExecC2C(plan, data, data, CUFFT_FORWARD);
cufftDestroy(plan);
cudaStreamSynchronize(0);
}

As I understand, each CPU thread will be given it own stream on GPU. Does it correct ?

If it is correct how can I get the stream that is assigned to each CPU thread so that I can pass it to the cufftSetStream().

If it is not correct so how can I use cufft API with multiple CPU thread and multiple stream?

Could you please help me with this?

I will be very appreciate.

See my answer above regarding the NPP library, which is similar. If you follow the instructions in the post and compile your code to use "-default-stream per-thread", you should be able to pass cudaStreamPerThread to cufftSetStream() so that it uses the default stream in each thread. Does this work?

Hello Mark,
I understand that as you mentioned, Enabling PTDS for your compilation units doesn't enable it for libraries that are separately compiled.
But I wish to enable PTDS for thrust library
How to I call thrustSetStream()
Thank you.

To set a stream for a Thrust algorithm you need to use the .on() method on the cuda::par execution policy, like so:

thrust::sort(thrust:cuda::par.on(stream), begin, end, comparator)

Hi Mark,
The example above works perfectly in my ubuntu system as well.
I could even obtained Figure 2 shown above using the command
nvcc --default-stream per-thread

However, when I try to do the same thing in the Nsight editor, I do not see the effect of --default-stream per-thread command.
I have written the command " --default-stream per-thread" on the Command box as "nvcc --default-stream per-thread" on the project properties -> settings -> Tool Settings in the NVCC Compiler.

I suspect if this is the correct place to put this flag. I have even tried putting it on the Build Stages -> Preprocessor options (-Xcompiler) but that too did not work.

Could you please guide me where should I put this command on the Nsight Editor.

Thanks and Warm Regards
Amit Gurung

Hi Amit,

Try adding the flag in Project Properties -> Settings -> Tool Settings -> NVCC Compiler -> Expert Setting:

1. ${COMMAND} --default-stream per-thread ${FLAGS} ${OUTPUT_FLAG} ${OUTPUT_PREFIX} ${OUTPUT} ${INPUTS}