Local variables in kernel

Dear all,

I am a bit unsure about the use of local variables inside a kernel. Does the code below look ok or would there be a conflict between the local variables
and the threads, e.g. a race condition? I would appreciate your advice, even if it is just “it is ok”.

Many thanks!

float real1, imag1, real2, imag2;

int id = threadIdx.x + blockDim.x * blockIdx.x;
if(id<nelements)
{
    real1 = d_realtcsf[id];
    imag1 = d_imagtcsf[id];
    real2 = d_realcftab[id];
    imag2 = d_imagcftab[id];
    d_realtcsf[id] = real1 * real2 - imag1 * imag2;
    d_imagtcsf[id] = real1 * imag2 + imag1 * real2;
};
__syncthreads();

here d_realtcsf and d_imagtcsf are in global memory which is none cached

so each thread needs to access the device memory, it is serial and will take about 500 clock cycle each without coalescing

btw,

if(id<nelements), this causes a branching which u don’t wanted. try not to use threadID in any flow control instruction in a warp

Many thanks! Still learning, so appreciate any help. I’ve tried to omit the if(id<nelements) but then get zeroes everywhere.

You seem to indicate that the code could be made even faster and I would appreciate how this could be done.

I was thinking about using shared memory but to allocate this inside a kernel the array size needs to be declared as a constant which I cannot

do. The array sizes vary as they are taken as input from an input file.

Would you or anyone else be happy to perhaps show with an equivalent code sample how this could be made even faster?

What I got now already is more than 30 times faster than the CPU equivalent.

Many thanks!

There’s really nothing wrong with an if statement like you have, which cuts off threads that extend past the array boundary. (This is quite common when your data size is not divisible by a good block size.) You obviously don’t want a huge number of unused threads, but a warp or two isn’t going to kill your performance, especially since this is memory bandwidth-bound code.

Also, you don’t need any __syncthreads() at the end of this code. That is only used to avoid shared memory race conditions when you have threads reading shared memory locations after other threads write to them. (And that’s basically it. It protects you from no other race conditions!)