Cross-Correlation with CUFFT

Hi everyone,

First thing first I want you to know that I’m kinda newbie in CUDA.

I’m developing under C/C++ language and doing some tests with CUDA and espacially with cuFFT.
I have several questions and I hope you’ll be able to help me.

  • I saw that cuFFT fonctions (cufftExecC2C, etc.) can’t be call by the device. Can someone confim this? And is there any FFT fonction that can be call by the device?

  • I work on a cross-correlation fonction. I divide the two pictures (on wich I will compute the croos-correlation) into small sub-pictures.
    Then, on each sub-picture I compute convolution (FFT -> multiplication -> invert FFT).
    Unfortunately the sub-pics are small (32*32). With the fex tests I’ve made I saw the convolution with the GPU is slower than with CPU, that’s understandable due to the size of the
    image (but maybe I’m wrong and it’s problem with my code). So my question is : the convolution time on GPU could it be faster than on CPU with this picture size?

Thanks in advance for your help.

PS : Sorry for my lame English, I hope you’ll understand my problems.

You don’t call cuFFT functions from the device, but this is device kernels that you could call from your cpu application, or takes the source-code an integrate inside your own kernel.

If you have 32x32 pixels sub-images, you could probably process at lot of them in parallel to efficiently use CUDA, and the probable best-process is to send whole images to CUDA devices, and them use CUDA GPU to parse image into sub-pictures and process each of them in parallel.

PS: Another thought, remember that you could send kernels to GPU and continue the CPU processing while GPU process your image, to use both instead of waiting for the GPU to finish it’s work.

Thanks for the quick answer.

That’s what I was trying to do but had the “cufftExecC2C problem” inside my kernel!

So the only on solution is to take the source-code and to integrate it inside your own kernel?

Yes that’s it.

Anyway you have to avoid doing ping-pong between CPU code and GPU kernels, it is often much more fast to have the GPU do all the hard work, even if it includes parts that are faster on CPU.

It is true that you can’t call the cufft routines directly from the device function or a kernel.

And if you split the image into small sub-images, send each sub-image to the device, fft-multiply-ifft and take the image back, it will be definitely slower than processing them all on the CPU.

But without implementing the FFT code yourself (or get some source code from elsewhere) and making separate device codes for the job you want to process, you can make some separate kernels for each step.

For example, you can have one kernel for dividing the image into small ones, another kernel for the multiplication part of the convolution, and another kernel (if required) to merge the results. And run the first kernel, call fft, run the second kernel, call ifft, and do the rest of your job.

The point is minimizing the amount of data transferred between the host and the device, and the number of transfers. Due to the data-parallel nature of the tasks of dividing and/or merging of images, division or merging of the large image will take only a tiny fraction of the total running time.

Even if you practice this kind of approach, you might not get enough speed-up if the original size of the image is not large enough. In that case, you can transfer many images to reduce the ratio of time required to transfer data to time required to calculate data.

What exactly is the point of dividing picture into small parts? To compute one cross-correlation on whole image is not the same as to compute many cross-correlations on parts of image, unless the small parts overlap – which, of course, slows the computation.

Actually one large FFT can be much, MUCH slower than many overlapping smaller FFTs. This is the driving principle for fast convolution.

As a rule of thumb, the size of the FFT used should be about 4 times larger in each dimension than the convolution kernel. The FFT blocks must overlap in each dimension by the kernel dimension size-1. I wrote a paper on the subject a while back. It doesn’t really address multi-dimensional convolution, but the principles are roughly the same.