Hello world!
I am new to working in CUDA and I’m working on porting a DSP application. Hurray to CUDA!
I’m looking at the simpleCUFFT example and I was wondering regarding the complex multiplication step…
First, the purpose of the example is to apply convolution using the FFT. The input signal and the filter response vectors (arrays if you wish) are both padded (look up the book “Numerical Recipes in C” to understand how and why it is done the way it is…), then forward FFT is applied. The two (complex valued) vectors are then point-wise multiplied with each other. The resulting vector is then inverse FFT’d and so magically gives you the output of the filter response.
What I am curious about is how they implement the pointwise multiplication using a kernel launch. Simple code… The code is below.
...
// kernel launch
ComplexPointwiseMulAndScale<<<32, 256>>>(d_signal, d_filter_kernel, new_size, 1.0f / new_size);
...
// Complex pointwise multiplication
static __global__ void ComplexPointwiseMulAndScale(Complex* a, const Complex* b, int size, float scale)
{
const int numThreads = blockDim.x * gridDim.x;
const int threadID = blockIdx.x * blockDim.x + threadIdx.x;
for (int i = threadID; i < size; i += numThreads)
a[i] = ComplexScale(ComplexMul(a[i], b[i]), scale);
}
What I am not sure about is how they decided to launch 32 blocks and 256 threads per block. These numbers dont seem to relate to the size of the original input vector nor the filter response vector. However, it seems to me that the for loop condition (i < size) protects the threads from accessing non-existant data. Is this true? I’m new to working with threads so the way in which the job is distributed among the threads isn’t entirely clear…