As you can see the GTX480 performs very near peak bandwidth (88.5 % of peak ) while the GT 650m doesn’t reach quite as high ( 76.39 % of peak ). I guess the similarities with GF104 with respect to ILP makes this style of arch. a bit harder to reach peak for, for example the SDK bandwidth test only reaches about 32 GB/s (50%) on my GT 650m.
Here is a comparison on my laptop VS the SDK reduction sum example:
Bandwidth:
[attachment=25852:bandwidth_GT650m.png]
Speedup:
[attachment=25853:speedup_VS_NV_GT650m.png]
I hope all my numbers add up and that it runs well on other machines aswell External Image
The source code gives the test printout data shown above:
What kind of performance do you get if you simply use an atomic add? According to the Kepler whitepaper, the atomic add for a shared address should be about 11 times faster compared to Fermi. I’m however not sure if this is only true for Tesla, or also for Geforce.
I had to do a little fiddling to get this to compile on Linux. (remove the pause() function because there is no fgetchar() and it looks like you only half-finished get_clock for the non-Win32 folks :) )
Unfortunately, it looks like the shuffle instruction is no help for this problem.
Edit: On further reflection, that doesn’t seem to be surprising since the reduction seems to be completely saturating the memory bandwidth once the array gets big enough. The time spent doing the reduction in the block is mostly negligible.
Sorry about that! Didn’t do any testing on a linux machine before posting, just made some optimistic assumptions :P
I will post an update addressing this when I have time…
Yes the final per warp reduction is extemely negligible since each warp is first accumulating like ~2048 elements. But I do remember seeing some marginal improvements ( ~0.3% ) on my laptop.
Interesting results on the GTX680! I’m a bit disappointed that it doesn’t reach quite as high as on Fermi. Do you think there’s any chance that GK104 has partition camping issues ? I know this was supposed to have been adressed for Fermi but I believe it was still visible with GF100.
Anyways, I’m reaching higher bandwidth utilization doing this reduction sum than I am running the bandwith test so it’s not too bad :)
Out of curiosity, I took a look at the speed of doing reductions with atomic functions. While not as fast as the above reduction implementation, it is not as bad as you might think. I compared a very simple implementation that uses both shared and global atomic functions:
__global__ void shared_reduce(int size, float *in, float *out)
{
__shared__ float sum;
int start_idx = blockIdx.x * blockDim.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;
if (threadIdx.x == 0)
sum = 0.0f;
__syncthreads();
float local_sum = 0.0f;
for (int i=start_idx; i < size; i+=stride) {
local_sum += in[i];
}
atomicAdd(&sum, local_sum);
__syncthreads();
if (threadIdx.x == 0)
atomicAdd(out, sum);
}
to a very short version that only uses global atomic functions:
This was after some not-entirely-rigorous testing of different block and grid sizes, and settling on 128 threads per block, 128 blocks as working pretty well for both cards.
This interesting thing to note is that with the GTX 680, shared memory atomics are less important for reductions on medium-sized arrays. Or to put it another way: If you are in a hurry, you can write a really simple-minded reduction and be in the “good enough” zone. :)
I’m not too surprised that the atomics perform so well given that most of the computation time is actually spent in the for loop ( which is probably also true in Seiberts tests ). It does seem to confirm that Kepler atomics have a serious advantage over Fermi. It’s nice to see that you can simplify the code a lot and still get good performance! :)
@wanderine
I’m curious why the big interest in using atomics in this case when it’s not needed ?
I tried the thrust library and it was much slower than my own simple code.
I want to calculate 30 different values in each position of a 3D grid (like 128 x 128 x 128) and then sum the 30 values separately (i.e. 30 sums). If I use atomics, the code would be a single kernel, instead of the three kernels that I use now (calculate values and sum over x, sum over y, sum over z).
Sounds like a good place to try atomics, then. The amount of data being moved around is small enough that the overhead of launching new kernels probably swamps any benefit of doing a proper reduction without atomics.
Rather than a sum reduction, how about a scan which returns the max/min of an array and the corresponding index of that max/min. Something like thrust::max_element(). How would one implement that type of scan using __shfl() with or without atomics?
Also why is your device-to-device bandwidth so high? Is that also dependent on the PCI-e version?
I implemented such a min/max scan a long time ago for Fermi and earlier arch (hence no shuffle) in a post on [1]. It would need to be optimized and verified again for kepler arch. And I would do it without using atomics (although that is an option).
Also why is your device-to-device bandwidth so high? Is that also dependent on the PCI-e version?
It is independent of the PCI-e version as I’m only measuring D2D transfers e.g. device RAM to device RAM. The GTX Titan can theoretically do 288 GB/s. The code is optimized fo high bandwidth utilization.