Proper Use of Streams with Threads

Hi, I have several kernels that run in sequence with cuFFT calls between them, so for example, it would do kernel1->cufft->kernel2->cufft->kernel3. Since I cannot launch cufft from the device, this is the only way I can do this that I’m aware of. Since neither cufft or any of my kernels are using up all of the resources on the device, I want to launch more of those sequences above across multiple threads so that those kernel/cufft calls can operate concurrently on their own sets of data.

Is this correct way to do this to use threads with a separate thread ID, which will then act as my stream ID? Is OpenMP the recommended way to do this, or just use pthreads?

I’m guessing my code would look something like this:

func(stream) {
kernel1;
cudaStreamSynchronize(stream)
cufft;
cudaStreamSynchronize(stream)
kernel2;
cudaStreamSynchronize(stream)
cufft;
cudaStreamSynchronize(stream)
kernel3;
cudaStreamSynchronize(stream)
}

Each of my threads can run independently calling func in their own stream. Is there an easy way to know when I’ve run out of resources (threads/SM), or will the scheduler just queue up any pending requests and not just return an error if it can’t service the kernel call immediately?

The cufft has specific way to call it from different streams. I think you need a different plan for each stream. The only way to see if you are using the device at maximum is through the nvidia profiler. When there are resources available you will see ( assuming you programmed it well) different kernels (or functions) running in parallel or not.

Thanks, I actually have that in my code already, so i think it’s doing that correctly. I can see the streams overlapping in the profiler, so I believe HyperQ is working. I have a slightly different question though: I have a function just like the one above that takes a separate stream and executes that code. Each of my custom kernels launches kernels with 14 blocks (SMs on my Titan) and 1024 threads/block. My hope was that since there would be instructions in the pipeline, the scheduler would switch between multiple kernels so that all 4 are effectively running at once. However, I’m seeing lots of gaps in the profiler where only a couple of them are running instead of all 4. The profile picture is shown below. Is the cause of this l that the memory bandwidth is maxing out? The theoretical bandwidth is supposed to be 288GB/s, and I’m achieving about 218GB/s with fully-coalesced accesses. Is there a way to tell exactly why multiple kernels aren’t running?

External Media