Proper Use of Streams with Threads

shaklee3 · January 3, 2014, 11:44pm

Hi, I have several kernels that run in sequence with cuFFT calls between them, so for example, it would do kernel1->cufft->kernel2->cufft->kernel3. Since I cannot launch cufft from the device, this is the only way I can do this that I’m aware of. Since neither cufft or any of my kernels are using up all of the resources on the device, I want to launch more of those sequences above across multiple threads so that those kernel/cufft calls can operate concurrently on their own sets of data.

Is this correct way to do this to use threads with a separate thread ID, which will then act as my stream ID? Is OpenMP the recommended way to do this, or just use pthreads?

I’m guessing my code would look something like this:

func(stream) {
kernel1;
cudaStreamSynchronize(stream)
cufft;
cudaStreamSynchronize(stream)
kernel2;
cudaStreamSynchronize(stream)
cufft;
cudaStreamSynchronize(stream)
kernel3;
cudaStreamSynchronize(stream)
}

Each of my threads can run independently calling func in their own stream. Is there an easy way to know when I’ve run out of resources (threads/SM), or will the scheduler just queue up any pending requests and not just return an error if it can’t service the kernel call immediately?

pasoleatis · January 4, 2014, 12:57pm

The cufft has specific way to call it from different streams. I think you need a different plan for each stream. The only way to see if you are using the device at maximum is through the nvidia profiler. When there are resources available you will see ( assuming you programmed it well) different kernels (or functions) running in parallel or not.

shaklee3 · January 4, 2014, 4:10pm

Thanks, I actually have that in my code already, so i think it’s doing that correctly. I can see the streams overlapping in the profiler, so I believe HyperQ is working. I have a slightly different question though: I have a function just like the one above that takes a separate stream and executes that code. Each of my custom kernels launches kernels with 14 blocks (SMs on my Titan) and 1024 threads/block. My hope was that since there would be instructions in the pipeline, the scheduler would switch between multiple kernels so that all 4 are effectively running at once. However, I’m seeing lots of gaps in the profiler where only a couple of them are running instead of all 4. The profile picture is shown below. Is the cause of this l that the memory bandwidth is maxing out? The theoretical bandwidth is supposed to be 288GB/s, and I’m achieving about 218GB/s with fully-coalesced accesses. Is there a way to tell exactly why multiple kernels aren’t running?

External Media

Topic		Replies	Views
Multiple CUFFT in different streams? CUDA Programming and Performance	7	7186	July 5, 2008
cufft concurrent streams CUDA Programming and Performance	2	1932	August 20, 2014
CUFFT doesn't support multiple host threads each with a separate stream? Profiling Linux Targets cuda	2	71	December 9, 2024
Running CUDA kernels from two different pthreads CUDA Programming and Performance	7	3006	May 10, 2016
Help please: problem of running two cufftExecC2C concurrently? CUDA Programming and Performance	8	1403	December 30, 2011
Thread vs Stream what is the difference? CUDA Programming and Performance	6	5499	December 2, 2010
Conditions for CUDA streams to overlap CUDA Programming and Performance	5	4507	June 9, 2013
Concurrent Kernels CUDA Programming and Performance	3	4892	August 1, 2010
Cuda with openMP CUDA Programming and Performance	10	15865	June 13, 2010
Multiple CPU threads with multiple cudaStreams CUDA Programming and Performance	5	6403	July 23, 2015

Proper Use of Streams with Threads

Related topics