Multiple CUFFT in different streams?

Hi,

I have an application where I need to perform ~100 FFTs.

As far as I can see, the cuFFT documentation does not specify how to run calculate FFTs in parallel using different streams. Do you know if it’s possible and how it\s done?

Thanks, Thomas

you cannot have more than 1 kernel running at the same time, streams don’t change that.

does that mean one would have to come up with one kernel that can perform parallel FFTs?

Will the restriction that no more than 1 kernel can be running at any given time eventually be dropped? What if two distinct applications access CUDA at the same time? Will there be some multitasking between kernels or are kernels executed one after the other?

From the CUDA introduction, section 4.5.1.5:

"Applications manage concurrency through streams. A stream is a sequence of operations that execute in order. Different streams, on the other hand, may execute their operations out of order with respect to one another or concurrecntly.

Doesn’t this mean that different streams run concurrently? Since different streams run different kernels I read this as being able to start multiple kernels which can run at the same time.

What am I missing?

Thomas

may != will. It may happen on future hardware, but it will not happen on current hardware. Right now kernels are executed after each other. Within a stream the order is as you define, between streams (or CUDA contexts) the order of execution is undefined (but my guess is that currently it is a FIFO buffer)

edit : I say hardware, but for all I know it might be a matter of software to make it happen on current hardware, I have no idea.

Kernels in different streams executing out of order is not the same as them executing concurrently. In fact, the implication that there is an “order” to be “out of” indicates that kernels are executed one after another (or else, how could you assign an ordering).

What is it you are trying to do with parallel FFTs? Can you use the batch feature?

Yeah, if you have 100 arrays on which you want to calculate FFTs, merge them in a single big array and run a single FFT execution on it with batch = 100. IIRC you will then get a single output array that you will be able to breakdown into 100 smaller output arrays.