Optimize CUDA code when odd even threads do different work Optimize to avoid diveregnce

Hi all!

I have two large vectors, I am trying to do some sort of element multiplication, where an even-numbered element in the first vector is multiplied by the next odd-numbered element in the second vector … and where the odd-numbered element in the first vector is multiplied by the preceding even-numbered element in the second vector


vector 1 is V1(1) V1(2) V1(3) V1(4)

vector 2 is V2(1) V2(2) V2(3) V2(4)

V1(1) * V2(2)

V1(3) * V2(4)

V1(2) * V2(1)

V1(4) * V2(3)

I have written a Cuda code to do this: (Pds has the elements of the first vector in shared memory, Nds the second Vector)

//instead of using %2 … i check for the first bit to decide if number is odd/even --> faster

    if ((tx & 0x0001) ==  0x0000)

    Nds[tx+1] = Pds[tx] * Nds[tx+1];

else    Nds[tx-1] = Pds[tx] * Nds[tx-1];

Is there anyway to further accelerate this code or avoid divergence … (!)


If your code is part of a loop, it’s likely faster if written as

int offset = 1 - 2* (tx&1);

		Nds[tx+offset] *= Pds[tx];


although the compiler might do this transformation on its own already.

However, I do not see how either variant is supposed to work at all due to the race condition created by different threads accessing the same array elements.