Low performance on Ubuntu 18.04

Hello everyone!
I have some problems with N-body simulation.
I implemented this algorithm Chapter 31. Fast N-Body Simulation with CUDA
On GPU and CPU, but CPU solves this problem using smaller amount of time.
My program takes 3 arguments:
catalogue.csv - initial state of system
framesAmount - amount of needed frames(I’m using 1000 for testing)
writeRate - frequency of writing frames(I’m using 365)
So, my program calculates writeRate amount of frames without transferring data from GPU, then copies them and writing to the file.
On 1024 bodies CPU spends less than 2 seconds, GPU - around 7 seconds.
I’m using CUDA 10, 410 version of driver. My programs are written on pure C.
I installed CUDA and driver using this instruction: How do I install NVIDIA and CUDA drivers into Ubuntu? - Ask Ubuntu
How can I solve this problem?

  • make sure you are using float, not double, for positions, velocities, forces
  • make sure you are building a project without any -G debug switches (identify your compile command line)
  • Your choice of body count (1024) is small, and the article has specific suggestions for small body count
  • use a faster GPU (identify which GPU you are using)
  • use a profiler to identify performance bottlenecks
  • study the cuda nbody sample code [url]https://docs.nvidia.com/cuda/cuda-samples/index.html#cuda-n-body-simulation[/url]
  • it shouldn’t really be necessary, but you may want to learn about and experiment with the --use-fast-math compile switch

You’re welcome to post your code so that someone else could build it and run it; there may be better suggestions if you do so. If you choose to do so, my suggestion would be to provide a complete code; something that someone else could copy, paste, compile and run, without having to add anything or change anything.

I have worked on a simplistic nbody code (it is one of the codes we use in a DLI CUDA training course) and the CPU/GPU comparison usually becomes interesting when the body count exceeds 2^15. 1024 is much too small to be interesting on a large/modern/fast GPU.

I’m using GeForce GTX 1060 3GB. I will try to increase bodies amount a bit later. All code can be found here: https://github.com/DarkFuria/nBody
Right now I’m trying to install 415 driver & cuda 9.1, since recommended method don’t provides working of some other programs

I assume you are comparing to this:

[url]https://github.com/DarkFuria/nBody-CPU[/url]

I built both projects on a system with Fedora 27, Xeon X5560 2.8GHz, Quadro K2000 GPU (much slower than your GPU) and CUDA 9.2

Your catalogue1024.csv files between the two projects don’t match. Numerically I don’t think that should matter, but the one in the GPU project has only 1023 lines in it, so that seems incorrect.

In the GPU project, your settings.h sets N_BODYS to 4096 whereas the CPU project sets it to 128. I modified the GPU project to be 1024, and copied the catalogue1024.csv from the CPU project (which has 1024 lines in it) to the GPU project.

In your Makefile I modified sm_61 to sm_30 to match my GPU.

After that I compiled (make) and ran each project with the following command line:

time ./nbody catalogue1024.csv 1000 365

The GPU code (with 1024 bodies) required about 23 minutes to complete, whereas the CPU code (with 128 bodies) is still running with over 35 minutes on the clock.

So if you got these codes to run in 2 seconds (CPU) and 7 seconds (GPU), I’m very impressed with your setup. I don’t see any indication in my setup that the CPU is faster than the GPU, however.

Yes, excuse me. It’s my fault. I should add the same testing catalogue.
I run this tests again(catalogue1024.csv 1000 365):
GPU: 5m32s
CPU(with -Ofast): 2.5s
CPU(without -Ofast): (still working, more than 40 minutes)
So, I think that gcc uses some strange optimizations with -Ofast flag…

on a much newer system with a faster GPU, I was able to get the gpu test (1024 bodies) down to 10 minutes.

I killed the CPU test on the old system after 2 hours of execution time.

I killed the CPU test on the new system after 1 hour of execution time.

I see no indication that the CPU is faster than the GPU.

Your makefile for the CPU code already specifies -Ofast

I don’t see anything like 2.5 sec execution time, using that makefile, on gcc 5.4.0 or gcc 7.3.1

So I am unable to reproduce your observation.

I see no indication that the CPU is faster than the CPU without -Ofast GCC flag in CPU makefile.

That makes me wonder whether your host code is hitting one of those (in-)famous optimizations that take advantage of undefined behavior, such as signed integer overflow. When it detects undefined behavior, the compiler is justified in eliminating all affected code, which may cause a massive reduction in run time. Which version of gcc is being used here?

Have you checked whether compiling with -Ofast results in output anywhere close to the correct / expected results? A program allowed to deliver incorrect results can be made arbitrarily fast.

You may want to consider cranking up warning sensitivity to identify as much questionable code in the host version as possible, and address all reported items. At minimum, use -Wall (which contrary to its name does not turn on all warnings). In addition or alternatively, consider gradually increasing the optimization level, e.g. -O1, -O2, -O3, -O3 -ffast-math, -Ofast. At what stage does run time drop dramatically?

I’m using gcc 7.3.0 for my host code.
Run time drops dramatically at -O3 optimization level(0.8s per frame in 1024 cluster).
-Wall addition hadn’t shown any warnings.

Generally, I would not expect more than 20% speed-up between -O2 and -O3, but as far as I recall -O3 enables SIMD auto-vectorization and that could provide a much bigger kick in apps dominated by tight computational loops.

If I interpret your description above correctly, the working set of your data is very small and should be fully contained in the L1/L2 caches of the CPU, while at the same time limiting the parallelism GPUs can exploit (> 10K threads are desirable, but you would be using one tenth that).

How big is the speedup factors between -O2 and -O3? Have you checked whether the app returns correct results at -O3?

Speedup factor between -O2 & -O3 - around 7 times. No, hadn’t checked results at -O3. -Ofast is quicker than -O3 25 times.

These speedup factors seem improbably high too me.