Computing small FFTs with CUDA

jack · August 31, 2008, 2:50pm

I’m fairly new to CUDA development, so forgive me if this is an obvious question…

I’ve been looking at the benchmarks for computing FFTs on the CPU (using FFTW) and on the GPU (with CUDA); it seems that for small FFTs, there’s really no advantage to using CUDA over just computing on the CPU (due to transaction overhead and such).

However, I’ve been taking a look at the MusicBrainz Picard program today, which uses the MusicDNS service; if you have a music collection, this program goes through and reads part of each file, computes an FFT, does some linear algebra with it, then computes a unique value for that song that it can look up online to find the information for that song (Artist, Title, Track #, etc.).

I’ve got a moderately fast computer (Core2Duo 2.4Ghz, 4GB RAM, 8800GT), and it’s taking forever to tag my collection, presumably due to the FFT and Linear Algebra parts of the algorithm. The FFT computation uses (I think) 8192 elements, which really doesn’t get much of a speed-up with CUDA; is there a way to compute a bunch of smaller FFTs in parallel using CUDA? If so, maybe this is a neat idea for the next coding challenge!

MusicBrainz:
http://musicbrainz.org

Description of the Algorithm:
[url=“Google”]Google

Source Code for libofa, which does the actual computing work for the MusicBrainz software:
[url=“Google Code Archive - Long-term storage for Google Code Project Hosting.”]Google Code Archive - Long-term storage for Google Code Project Hosting.

_Big_Mac · August 31, 2008, 3:13pm

You can do batch FFTs using CUDA (with 1D transforms at least).

Your average 3:30 minutes long tune will have about 9 million frames/samples (with 44.1 kHz sampling rate). With no overlapping, this gives about a thousand 8192 point FFTs to calculate. Double that if you’re doing stereo.

Doing this in CUDA will be much faster than with FFFTW if you use the batch FFT feature in CUFFT. The batch feature allows to calculate multiple small FFTs in parallel - exactly what you’re looking for.

jack · August 31, 2008, 4:04pm

Cool…I’m going to give it a try to see if I can make it work.

Looking at the CUFFT documentation…if I want to make a call to cufftExecR2C(), the data stored in GPU memory should be a 1-D array with all of the data points for the batch, correct? I.e.:

|| Data for FFT 1 || Data for FFT 2 || Data for FFT 3 || …

Also, how can I determine the number of FFTs to call in the batch? Is that limited only by the amount of GPU memory?

_Big_Mac · August 31, 2008, 8:56pm

Yes, the data should be aligned continuously just as you described.

I remember reading that it is hard to predict how much space will a given FFT call need because there’s some stochastic guesswork done by the library (can anyone confirm this?).

There has been a CUFFT benchmark posted on the forum. I’ve just modified the code to use batch=1024 and it works fine (does about 6300 FFTs per second, compared to about 2000 with batch=1). This is with an 8800GTS 512. It starts to crash with batch in the range of a couple thousands (presumably out of memory).

For safety, I’d limit the batch to about 128. I get around 6200 FFTs/s in this configuration so there’s not much benefit in terms of saturating the GPU with more concurrent FFTs. Bigger batches seem to get scheduled and executed sequentially anyway and the memory requirements make the whole thing unstable at batch>2048.

If you want a really nice, solid implementation you should include a check for the users’ GPU memory and maybe limit the batch# if he’s running a 8500 or 8400.

BTW don’t get the impression that batch size must be a power of two. There’s no such requirement, I just like powers of two ;)

jack · September 1, 2008, 4:26pm

I’m working on the code today…but I just had a thought about what you wrote…

If I run a batch of 128 FFTs, with sample size=8192…using single precision floats, that equals exactly 2MB of device memory.

I imagine that using larger batches causes the device to simply run the batches in blocks, where the block size is the number of stream processors on your card. The 8800GTS has 128 stream processors, which is why you won’t see any improvement above that number (per batch).

I think what I’m going to do for my application is check the number of stream processors and memory capacity in the device, and determine the amount to run there. For smaller cards (like the 8400 series), the batches will just have to be broken into smaller bits before being sent to the card for processing.

Topic		Replies	Views
what's cufft batch? CUDA Programming and Performance	3	13611	May 26, 2008
FFT algorithm implementation CUDA Programming and Performance	5	1552	December 31, 2009
Short 1D complex FFT performance [2046-32768] CUDA Programming and Performance	4	5190	May 11, 2007
cuFFT doubt. GPU-Accelerated Libraries	1	836	January 18, 2015
(Uncodumented?) limit of fftsize for 1d batch ffts Problem w/CUFFT docs and w/cufftPlan1d CUDA Programming and Performance	0	2658	October 19, 2008
Parralell FFT? CUDA Programming and Performance	5	2440	May 27, 2009
many small fft's CUDA Programming and Performance	2	9204	June 5, 2007
CUFFT (and kernel) questions CUDA Programming and Performance	1	2258	August 14, 2009
cufft 2.3 batched 1D fft of size 80 on 1GB relative low performance of batched cufft of size 80 on 1 CUDA Programming and Performance	0	2110	August 17, 2009
Limit of CUFFT 1D with Batch CUDA Programming and Performance	0	1433	July 20, 2010

Computing small FFTs with CUDA

Related topics