Computing small FFTs with CUDA

I’m fairly new to CUDA development, so forgive me if this is an obvious question…

I’ve been looking at the benchmarks for computing FFTs on the CPU (using FFTW) and on the GPU (with CUDA); it seems that for small FFTs, there’s really no advantage to using CUDA over just computing on the CPU (due to transaction overhead and such).

However, I’ve been taking a look at the MusicBrainz Picard program today, which uses the MusicDNS service; if you have a music collection, this program goes through and reads part of each file, computes an FFT, does some linear algebra with it, then computes a unique value for that song that it can look up online to find the information for that song (Artist, Title, Track #, etc.).

I’ve got a moderately fast computer (Core2Duo 2.4Ghz, 4GB RAM, 8800GT), and it’s taking forever to tag my collection, presumably due to the FFT and Linear Algebra parts of the algorithm. The FFT computation uses (I think) 8192 elements, which really doesn’t get much of a speed-up with CUDA; is there a way to compute a bunch of smaller FFTs in parallel using CUDA? If so, maybe this is a neat idea for the next coding challenge!


Description of the Algorithm:

Source Code for libofa, which does the actual computing work for the MusicBrainz software:

You can do batch FFTs using CUDA (with 1D transforms at least).

Your average 3:30 minutes long tune will have about 9 million frames/samples (with 44.1 kHz sampling rate). With no overlapping, this gives about a thousand 8192 point FFTs to calculate. Double that if you’re doing stereo.

Doing this in CUDA will be much faster than with FFFTW if you use the batch FFT feature in CUFFT. The batch feature allows to calculate multiple small FFTs in parallel - exactly what you’re looking for.

Cool…I’m going to give it a try to see if I can make it work.

Looking at the CUFFT documentation…if I want to make a call to cufftExecR2C(), the data stored in GPU memory should be a 1-D array with all of the data points for the batch, correct? I.e.:

|| Data for FFT 1 || Data for FFT 2 || Data for FFT 3 || …

Also, how can I determine the number of FFTs to call in the batch? Is that limited only by the amount of GPU memory?

Yes, the data should be aligned continuously just as you described.

I remember reading that it is hard to predict how much space will a given FFT call need because there’s some stochastic guesswork done by the library (can anyone confirm this?).

There has been a CUFFT benchmark posted on the forum. I’ve just modified the code to use batch=1024 and it works fine (does about 6300 FFTs per second, compared to about 2000 with batch=1). This is with an 8800GTS 512. It starts to crash with batch in the range of a couple thousands (presumably out of memory).

For safety, I’d limit the batch to about 128. I get around 6200 FFTs/s in this configuration so there’s not much benefit in terms of saturating the GPU with more concurrent FFTs. Bigger batches seem to get scheduled and executed sequentially anyway and the memory requirements make the whole thing unstable at batch>2048.

If you want a really nice, solid implementation you should include a check for the users’ GPU memory and maybe limit the batch# if he’s running a 8500 or 8400.

BTW don’t get the impression that batch size must be a power of two. There’s no such requirement, I just like powers of two ;)

I’m working on the code today…but I just had a thought about what you wrote…

If I run a batch of 128 FFTs, with sample size=8192…using single precision floats, that equals exactly 2MB of device memory.

I imagine that using larger batches causes the device to simply run the batches in blocks, where the block size is the number of stream processors on your card. The 8800GTS has 128 stream processors, which is why you won’t see any improvement above that number (per batch).

I think what I’m going to do for my application is check the number of stream processors and memory capacity in the device, and determine the amount to run there. For smaller cards (like the 8400 series), the batches will just have to be broken into smaller bits before being sent to the card for processing.