Hi Njuffa,
It is a simple O(N^2) method I use already for years in my Cuda courses.
Here is the inner loop : each particle is assigned a thread and computes the force contribution of the other particles in the same block while the particle info is stored in shared memory.
Computations are all in single precision.
typedef float real_t;
#define LAMBDA (real_t)6.6730e-11 /* gravitational constant */
…
real_t ac,r,r2;
…
{
r = sqrtf(r2);
#ifdef USEFASTMATH
ac = __fdividef(LAMBDA * massm[c], r2);
at.x += __fdividef(ac * d.x , r);
at.y += __fdividef(ac * d.y , r);
#else
ac = LAMBDA * massm[c] / r2;
at.x += ac * d.x / r;
at.y += ac * d.y / r;
#endif
}
On all Cuda devices so far (I started about 8 years ago so you are 2+ years ahead :-) ) using --use_fast_math would give a factor 2-3 improvement on the standard code (so without USEFASTMATH).
However, we recently replaced the cards in the labroom with K620s and now it suddenly changed to being slightly slower ! The same on our Titan X system (all arch 5.0).
I changed the code to the part above (using the USEFASTMATH macro) and that gave a slight improvement compared to standard code but nowhere near the factor 2-3 we had before.
I also tried explicitly “-prec-div=false -ftz=false -prec-sqrt=false --fmad=true” in the Makefile but to no avail. The arch 5.0 aren’t faster using fast math than without, while arch 20 and 30 are still 2-3 times faster.
Compiling output :
nvcc -DDEBUG=1 -Xcompiler “-Wall -Wswitch -Wformat -Wchar-subscripts -Wparentheses -Wmultichar -Wtrigraphs -Wpointer-arith -Wcast-align -Wreturn-type -Wno-unused-function” -D_DEBUG -G -I. -I/opt64/cuda_6.5.14/include -I/opt64/cuda_6.5.14/samples/common/inc -DUNIX -D_GLIBCXX_GCC_GTHR_POSIX_H -g -Xptxas -v -O -prec-div=false -ftz=false -prec-sqrt=false --fmad=true -DUSEFASTMATH=1 -arch=sm_20 -o parbody_card-psm1.o -dc parbody_card-psm1.cu
ptxas info : 128 bytes gmem, 48 bytes cmem[14]
ptxas info : Function properties for _Z8ldebug_diPKci
24 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Compiling entry function ‘_Z13parbodyKernelP8_paramsdP3CrdS2_PfPc’ for ‘sm_20’
ptxas info : Function properties for _Z13parbodyKernelP8_paramsdP3CrdS2_PfPc
24 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 46 registers, 16016 bytes smem, 72 bytes cmem[0]
nvcc -o parbody-psm1-fastmath parbody_card-psm1.o parbody_host-pl.o -L/opt64/cuda_6.5.14/lib64 -L/opt64/cuda_6.5.14/lib -L/opt64/cuda_6.5.14/samples/lib -L/opt64/cuda_6.5.14/samples/common/lib -lcudart -lGL -lGLU -lglut -L. -ljgraphlib_x86_64 -arch=sm_20
K600 result on a small problem of 1024 particles :
~/Parallel/Cuda/Nbody-GPU 244 % ./parbody-psm1-fastmath TF/1024grid.par
Parallel particle simulation using straightforward PP: distribute all particles over the processes.
Using 4 blocks for 1024 bodies …
Run time parallel on GPU : 2.995042 secs for 400 iterations.
~/Parallel/Cuda/Nbody-GPU 245 % ./parbody-psm1-nofastmath TF/1024grid.par
Parallel particle simulation using straightforward PP: distribute all particles over the processes.
Using 4 blocks for 1024 bodies …
Run time parallel on GPU : 9.913860 secs for 400 iterations.
K620 result on a small problem of 1024 particles :
~/Parallel/Cuda/Nbody-GPU 247 % ./parbody-psm1-nofastmath TF/1024grid.par
Parallel particle simulation using straightforward PP: distribute all particles over the processes.
Using 4 blocks for 1024 bodies …
Run time parallel on GPU : 3.370911 secs for 400 iterations.
prz03:~/Parallel/Cuda/Nbody-GPU 248 % ./parbody-psm1-fastmath TF/1024grid.par
Parallel particle simulation using straightforward PP: distribute all particles over the processes.
Using 4 blocks for 1024 bodies …
Run time parallel on GPU : 2.746861 secs for 400 iterations.
============================================================
Remarkable is that if I do NOT use -DUSEFASTMATH=1 it is even slower, while one would expect that it wouldn’t make any difference as the --use_fast_math flag should make exactly the same replacements as the -DUSEFASTMATH=1 macro.
~/Parallel/Cuda/Nbody-GPU 251 % ./parbody-psm1-fastmath TF/1024grid.par
Parallel particle simulation using straightforward PP: distribute all particles over the processes.
Using 4 blocks for 1024 bodies …
Run time parallel on GPU : 3.051782 secs for 400 iterations.
BTW : I also recompiled for arch sm_35 and arch_50 but that doesn’t change the ratio fastmath/nofastmath (although the code seems to be slightly faster in general).
Hopes this gives you a little more info.
Best,
Kees