Optimizing Vector elements swaps using CUDA the swaps involved are not direct!!

Last_time · May 20, 2010, 11:18am

Hi all,

Since I am new to cuda … I need your kind help, I have this long vector, for each group of 24 elements, I need to do the following:
for the first 12 elements, the even numbered elements are multiplied by -1,
for the second 12 elements, the odd numbered elements are multiplied by -1
then the following swap takes place (image attached) and here is a link to it :

Graph swap image

I have written this piece of code, and wonder if you could help me further optimize it to solve for divergence or bank conflicts …
[indent]
//subvector is a multiple of 24, Mds and Nds are shared memory

shared double Mds[subVector];
shared double Nds[subVector];

int tx = threadIdx.x;
int tx_mod = tx ^ 0x0001;
int basex = __umul24(blockDim.x, blockIdx.x);

Mds[tx] = M.elements[basex + tx];
__syncthreads();

// flip the signs
if (tx < (tx/24)*24 + 12)
{
[indent] //if < 12 and even
if ((tx & 0x0001)==0)
Mds[tx] = -Mds[tx];[/indent]
}
else
if (tx < (tx/24)*24 + 24)
{
[indent]//if >12 and < 24 and odd
if ((tx & 0x0001)==1)
Mds[tx] = -Mds[tx];[/indent]
}

__syncthreads();

if (tx < (tx/24)*24 + 6)
{
//for the first 6 elements … swap with last six in the 24elements group (see graph)
[indent]Nds[tx] = Mds[tx_mod + 18];
Mds [tx_mod + 18] = Mds [tx];
Mds[tx] = Nds[tx];[/indent]
}
else
if (tx < (tx/24)*24 + 12)
{
// for the second 6 elements … swp with next adjacent group (see graph)
[indent]Nds[tx] = Mds[tx_mod + 6];
Mds [tx_mod + 6] = Mds [tx];
Mds[tx] = Nds[tx];[/indent]
}

__syncthreads();[/indent]

Thanks in advance …

tera · May 20, 2010, 11:52am

Sure:

//subvector is a multiple of 24, Mds and Nds are shared memory

	_shared_ double Mds[subVector];

	int tx = threadIdx.x;

	int tx_mod = tx ^ 0x0001;

	int  basex = __umul24(blockDim.x, blockIdx.x);

	int permuted_idx = ((tx/6) ^ 3) * 6 + (tx%6) ^ 1;

	int negate = (tx%24 <12) ^ (rx & 1);

	Mds[tx] = M.elements[basex + permuted_idx];

	if (negate)

		Mds[tx] = -Mds[tx];

	__syncthreads();

tera · May 20, 2010, 12:09pm

I’d personally be interested to know if this one is slower or faster:

_shared_ double Mds[subVector];

	int tx = threadIdx.x;

	int tx_mod = tx ^ 0x0001;

	int  basex = __umul24(blockDim.x, blockIdx.x);

	int permuted_idx = ((tx/6) ^ 3) * 6 + (tx%6) ^ 1;

	double sign = (tx%24 <12) ^ (rx & 1) ? 1.0 : -1.0;

	Mds[tx] = sign * M.elements[basex + permuted_idx];

	__syncthreads();

I guess it’s faster because it avoids some shared memory accesses. If Mds[tx] were a register, I’d be less sure. It would all depend on whether predicated double operations are still scheduled if the predicate is false.

Last_time · May 20, 2010, 4:04pm

Hi Tera,

Thanks for the help, I like the simplicity of the code External Media

before timings, in the code it should be double sign = (tx%24 <12) ^ (tx & 1) ? -1.0 : 1.0; External Media to give correct results

the codes has been tested on a G210 device, used the cutil library timers (from Nvidia SDK) as a way to measure time

As for the timing … i have used an input vector of size 49152 elements and ran the code for 1,000,000 times …

the average time of execution of your suggested version is: 0.004721ms and for the original code herein is 0.004757ms

I quite find it strange, your version reduced shared memory space by half and same for the accesses … yet no significant difference is observed … the G210 has a relaxed memory coalescing model, so the non-continuous access of the global memory is of little effect on the results … If am not mistaken, there is no bank conflicts … so I truly don’t know why there has been significant improvement … could anyone help??

I’d personally be interested to know if this one is slower or faster:
_shared_ double Mds[subVector];

	int tx = threadIdx.x;

	int tx_mod = tx ^ 0x0001;

	int  basex = __umul24(blockDim.x, blockIdx.x);

	int permuted_idx = ((tx/6) ^ 3) * 6 + (tx%6) ^ 1;

	double sign = (tx%24 <12) ^ (rx & 1) ? 1.0 : -1.0;

	Mds[tx] = sign * M.elements[basex + permuted_idx];

	__syncthreads();
I guess it’s faster because it avoids some shared memory accesses. If Mds[tx] were a register, I’d be less sure. It would all depend on whether predicated double operations are still scheduled if the predicate is false.

tera · May 20, 2010, 4:11pm

Oops, yes, of course.

I’m not sure about this, but the CUDA compiler does very aggressive optimization. Could be that it does this optimization on its own already. You might want to check the PTX output.

Anyway, the code is most likely bandwidth bound, so that no computing optimization will make it faster.

Last_time · May 21, 2010, 12:38pm

Hi Tera,

It seems I had smthg wrong using double precision on hardware that doesn’t support it … actually your code is faster

Average times for the gamma1 matrix when when kernel/golden version are executed 1,000,000 times

time on gpu = 0.160034ms (YOUR VERSION)

time on gpu = 0.190041ms (My Version)

time on cpu = 0.091949ms

Test PASSED - Results are Equal

but as you can see, both are slower than the CPU, A quad core Q9300 @ 2.5GHZ, 3MB L2 cache with 4gigs of memory!! :( :( :(

tera · May 21, 2010, 1:05pm

Ah, I should have asked you when I stumbled over the fact that the kernels executed in about 4 microseconds. I silently assumed you probably meant milliseconds instead. So are you using float now? And is your device compute capability 1.2 (otherwise my code would be much slower)?

Are you doing something else with the data than just reordering it? Otherwise it seems quite clear it’s impossible to make up for the transfer time to and from the device.

Last_time · May 21, 2010, 1:45pm

Yes, using float!! My device is G210, so it is compute 1.2 capable!

The whole code is just read a vector of 49152 elements, reorder it, adjust the signs and copy it back! nthg more or less!

I have been working on this code for long yet with no remarkable speed ups !!

Any help is appreciated!

tera · May 21, 2010, 2:02pm

In that case I’m sorry to say it’s impossible to beat your CPU. Particularly as the CPU version will probably operate this entirely within it’s L2/L3 cache (depending on whether you use multiple cores), whose bandwidths are at least an order of magnitude larger than that of the PCIe link the data has to pass to get to the GPU.

Last_time · May 21, 2010, 2:06pm

Thanks for the help Tera!! You have been very helpful indeed! :)

Topic		Replies	Views
Optimize CUDA code when odd even threads do different work Optimize to avoid diveregnce CUDA Programming and Performance	1	2081	May 18, 2010
Cuda program results are always zero in HW, correct in EMU? CUDA Programming and Performance	35	11369	May 23, 2010
Faulty sort algorithm. Please help! Odd even sort CUDA Programming and Performance	9	11586	July 26, 2008
Reordering a vector. kernel working only for single precision CUDA Programming and Performance	0	1327	July 30, 2011
Can CUDA do permutations CUDA Programming and Performance	11	12743	March 30, 2012
Write to shared memory CUDA Programming and Performance	13	2263	July 2, 2009
Cuda Latency problems Slow Cuda CUDA Programming and Performance	15	14007	September 5, 2008
Cuda Memory Bank layout Interleaving, Addressing, Conflicts CUDA Programming and Performance	25	61557	September 4, 2008
Reorder a vector CUDA Programming and Performance	2	794	July 8, 2011
Performance issue R/W operation too slow CUDA Programming and Performance	5	3275	October 5, 2009

Optimizing Vector elements swaps using CUDA the swaps involved are not direct!!

Related topics