Optimizing Vector elements swaps using CUDA the swaps involved are not direct!!

Hi all,

Since I am new to cuda … I need your kind help, I have this long vector, for each group of 24 elements, I need to do the following:
for the first 12 elements, the even numbered elements are multiplied by -1,
for the second 12 elements, the odd numbered elements are multiplied by -1
then the following swap takes place (image attached) and here is a link to it :

Graph swap image

I have written this piece of code, and wonder if you could help me further optimize it to solve for divergence or bank conflicts …
[indent]
//subvector is a multiple of 24, Mds and Nds are shared memory

shared double Mds[subVector];
shared double Nds[subVector];

int tx = threadIdx.x;
int tx_mod = tx ^ 0x0001;
int basex = __umul24(blockDim.x, blockIdx.x);

Mds[tx] = M.elements[basex + tx];
__syncthreads();

// flip the signs
if (tx < (tx/24)*24 + 12)
{
[indent] //if < 12 and even
if ((tx & 0x0001)==0)
Mds[tx] = -Mds[tx];[/indent]
}
else
if (tx < (tx/24)*24 + 24)
{
[indent]//if >12 and < 24 and odd
if ((tx & 0x0001)==1)
Mds[tx] = -Mds[tx];[/indent]
}

__syncthreads();

if (tx < (tx/24)*24 + 6)
{
//for the first 6 elements … swap with last six in the 24elements group (see graph)
[indent]Nds[tx] = Mds[tx_mod + 18];
Mds [tx_mod + 18] = Mds [tx];
Mds[tx] = Nds[tx];[/indent]
}
else
if (tx < (tx/24)*24 + 12)
{
// for the second 6 elements … swp with next adjacent group (see graph)
[indent]Nds[tx] = Mds[tx_mod + 6];
Mds [tx_mod + 6] = Mds [tx];
Mds[tx] = Nds[tx];[/indent]
}

__syncthreads();[/indent]

Thanks in advance …

Sure:

//subvector is a multiple of 24, Mds and Nds are shared memory

	_shared_ double Mds[subVector];

	int tx = threadIdx.x;

	int tx_mod = tx ^ 0x0001;

	int  basex = __umul24(blockDim.x, blockIdx.x);

	int permuted_idx = ((tx/6) ^ 3) * 6 + (tx%6) ^ 1;

	int negate = (tx%24 <12) ^ (rx & 1);

	Mds[tx] = M.elements[basex + permuted_idx];

	if (negate)

		Mds[tx] = -Mds[tx];

	__syncthreads();

I’d personally be interested to know if this one is slower or faster:

_shared_ double Mds[subVector];

	int tx = threadIdx.x;

	int tx_mod = tx ^ 0x0001;

	int  basex = __umul24(blockDim.x, blockIdx.x);

	int permuted_idx = ((tx/6) ^ 3) * 6 + (tx%6) ^ 1;

	double sign = (tx%24 <12) ^ (rx & 1) ? 1.0 : -1.0;

	Mds[tx] = sign * M.elements[basex + permuted_idx];

	__syncthreads();

I guess it’s faster because it avoids some shared memory accesses. If Mds[tx] were a register, I’d be less sure. It would all depend on whether predicated double operations are still scheduled if the predicate is false.

Hi Tera,

Thanks for the help, I like the simplicity of the code External Media

before timings, in the code it should be double sign = (tx%24 <12) ^ (tx & 1) ? -1.0 : 1.0; External Media to give correct results

the codes has been tested on a G210 device, used the cutil library timers (from Nvidia SDK) as a way to measure time

As for the timing … i have used an input vector of size 49152 elements and ran the code for 1,000,000 times …

the average time of execution of your suggested version is: 0.004721ms and for the original code herein is 0.004757ms

I quite find it strange, your version reduced shared memory space by half and same for the accesses … yet no significant difference is observed … the G210 has a relaxed memory coalescing model, so the non-continuous access of the global memory is of little effect on the results … If am not mistaken, there is no bank conflicts … so I truly don’t know why there has been significant improvement … could anyone help??

Oops, yes, of course.

I’m not sure about this, but the CUDA compiler does very aggressive optimization. Could be that it does this optimization on its own already. You might want to check the PTX output.

Anyway, the code is most likely bandwidth bound, so that no computing optimization will make it faster.

Hi Tera,

It seems I had smthg wrong using double precision on hardware that doesn’t support it … actually your code is faster

Average times for the gamma1 matrix when when kernel/golden version are executed 1,000,000 times

time on gpu = 0.160034ms (YOUR VERSION)

time on gpu = 0.190041ms (My Version)

time on cpu = 0.091949ms

Test PASSED - Results are Equal

but as you can see, both are slower than the CPU, A quad core Q9300 @ 2.5GHZ, 3MB L2 cache with 4gigs of memory!! :( :( :(

Ah, I should have asked you when I stumbled over the fact that the kernels executed in about 4 microseconds. I silently assumed you probably meant milliseconds instead. So are you using float now? And is your device compute capability 1.2 (otherwise my code would be much slower)?

Are you doing something else with the data than just reordering it? Otherwise it seems quite clear it’s impossible to make up for the transfer time to and from the device.

Yes, using float!! My device is G210, so it is compute 1.2 capable!

The whole code is just read a vector of 49152 elements, reorder it, adjust the signs and copy it back! nthg more or less!

I have been working on this code for long yet with no remarkable speed ups !!

Any help is appreciated!

In that case I’m sorry to say it’s impossible to beat your CPU. Particularly as the CPU version will probably operate this entirely within it’s L2/L3 cache (depending on whether you use multiple cores), whose bandwidths are at least an order of magnitude larger than that of the PCIe link the data has to pass to get to the GPU.

Thanks for the help Tera!! You have been very helpful indeed! :)