Question regarding Complex Multiplication in FFT Convolution

Hello world!

I am new to working in CUDA and I’m working on porting a DSP application. Hurray to CUDA!

I’m looking at the simpleCUFFT example and I was wondering regarding the complex multiplication step…

First, the purpose of the example is to apply convolution using the FFT. The input signal and the filter response vectors (arrays if you wish) are both padded (look up the book “Numerical Recipes in C” to understand how and why it is done the way it is…), then forward FFT is applied. The two (complex valued) vectors are then point-wise multiplied with each other. The resulting vector is then inverse FFT’d and so magically gives you the output of the filter response.

What I am curious about is how they implement the pointwise multiplication using a kernel launch. Simple code… The code is below.

...

// kernel launch

 ComplexPointwiseMulAndScale<<<32, 256>>>(d_signal, d_filter_kernel, new_size, 1.0f / new_size);

...

// Complex pointwise multiplication

static __global__ void ComplexPointwiseMulAndScale(Complex* a, const Complex* b, int size, float scale)

{

    const int numThreads = blockDim.x * gridDim.x;

    const int threadID = blockIdx.x * blockDim.x + threadIdx.x;

    for (int i = threadID; i < size; i += numThreads)

        a[i] = ComplexScale(ComplexMul(a[i], b[i]), scale);     

}

What I am not sure about is how they decided to launch 32 blocks and 256 threads per block. These numbers dont seem to relate to the size of the original input vector nor the filter response vector. However, it seems to me that the for loop condition (i < size) protects the threads from accessing non-existant data. Is this true? I’m new to working with threads so the way in which the job is distributed among the threads isn’t entirely clear…

The structure itself has nothing to do with FFT

say you have 4 threads an 13 elements. Just imagine as if the 13 elements are arranged in 4 columns

0 1 2 3
4 5 6 7
8 9 10 11
12

Thread x would only process the xth column. After each iteration it processes the next point in the same column. That’s why i is incremented by number of threads (number of columns). When i is greater than the vector size, it aborts. That’s it. No matter how big the vector is, this structure always handles well.

cool. so basically its just getting a whole bunch of workers and splitting the job between them. Is there a way in order to optimize the number of blocks and threads relative to the number of MPs in the hardware? How can I pick “good” numbers?