Since I am new to cuda … I need your kind help, I have this long vector, for each group of 24 elements, I need to do the following:
for the first 12 elements, the even numbered elements are multiplied by -1,
for the second 12 elements, the odd numbered elements are multiplied by -1
then the following swap takes place (image attached) and here is a link to it :
I have written this piece of code, and wonder if you could help me further optimize it to solve for divergence or bank conflicts …
[indent]
//subvector is a multiple of 24, Mds and Nds are shared memory
// flip the signs
if (tx < (tx/24)*24 + 12)
{
[indent] //if < 12 and even
if ((tx & 0x0001)==0)
Mds[tx] = -Mds[tx];[/indent]
}
else
if (tx < (tx/24)*24 + 24)
{
[indent]//if >12 and < 24 and odd
if ((tx & 0x0001)==1)
Mds[tx] = -Mds[tx];[/indent]
}
__syncthreads();
if (tx < (tx/24)*24 + 6)
{
//for the first 6 elements … swap with last six in the 24elements group (see graph)
[indent]Nds[tx] = Mds[tx_mod + 18];
Mds [tx_mod + 18] = Mds [tx];
Mds[tx] = Nds[tx];[/indent]
}
else
if (tx < (tx/24)*24 + 12)
{
// for the second 6 elements … swp with next adjacent group (see graph)
[indent]Nds[tx] = Mds[tx_mod + 6];
Mds [tx_mod + 6] = Mds [tx];
Mds[tx] = Nds[tx];[/indent]
}
I guess it’s faster because it avoids some shared memory accesses. If Mds[tx] were a register, I’d be less sure. It would all depend on whether predicated double operations are still scheduled if the predicate is false.
Thanks for the help, I like the simplicity of the code External Media
before timings, in the code it should be double sign = (tx%24 <12) ^ (tx & 1) ? -1.0 : 1.0; External Media to give correct results
the codes has been tested on a G210 device, used the cutil library timers (from Nvidia SDK) as a way to measure time
As for the timing … i have used an input vector of size 49152 elements and ran the code for 1,000,000 times …
the average time of execution of your suggested version is: 0.004721ms and for the original code herein is 0.004757ms
I quite find it strange, your version reduced shared memory space by half and same for the accesses … yet no significant difference is observed … the G210 has a relaxed memory coalescing model, so the non-continuous access of the global memory is of little effect on the results … If am not mistaken, there is no bank conflicts … so I truly don’t know why there has been significant improvement … could anyone help??
I’m not sure about this, but the CUDA compiler does very aggressive optimization. Could be that it does this optimization on its own already. You might want to check the PTX output.
Anyway, the code is most likely bandwidth bound, so that no computing optimization will make it faster.
Ah, I should have asked you when I stumbled over the fact that the kernels executed in about 4 microseconds. I silently assumed you probably meant milliseconds instead. So are you using float now? And is your device compute capability 1.2 (otherwise my code would be much slower)?
Are you doing something else with the data than just reordering it? Otherwise it seems quite clear it’s impossible to make up for the transfer time to and from the device.
In that case I’m sorry to say it’s impossible to beat your CPU. Particularly as the CPU version will probably operate this entirely within it’s L2/L3 cache (depending on whether you use multiple cores), whose bandwidths are at least an order of magnitude larger than that of the PCIe link the data has to pass to get to the GPU.