Low performance on Ubuntu 18.04

robinhood010501 · January 5, 2019, 10:13am

Hello everyone!
I have some problems with N-body simulation.
I implemented this algorithm Chapter 31. Fast N-Body Simulation with CUDA
On GPU and CPU, but CPU solves this problem using smaller amount of time.
My program takes 3 arguments:
catalogue.csv - initial state of system
framesAmount - amount of needed frames(I’m using 1000 for testing)
writeRate - frequency of writing frames(I’m using 365)
So, my program calculates writeRate amount of frames without transferring data from GPU, then copies them and writing to the file.
On 1024 bodies CPU spends less than 2 seconds, GPU - around 7 seconds.
I’m using CUDA 10, 410 version of driver. My programs are written on pure C.
I installed CUDA and driver using this instruction: How do I install NVIDIA and CUDA drivers into Ubuntu? - Ask Ubuntu
How can I solve this problem?

Robert_Crovella · January 5, 2019, 2:29pm

make sure you are using float, not double, for positions, velocities, forces
make sure you are building a project without any -G debug switches (identify your compile command line)
Your choice of body count (1024) is small, and the article has specific suggestions for small body count
use a faster GPU (identify which GPU you are using)
use a profiler to identify performance bottlenecks
study the cuda nbody sample code [url]https://docs.nvidia.com/cuda/cuda-samples/index.html#cuda-n-body-simulation[/url]
it shouldn’t really be necessary, but you may want to learn about and experiment with the --use-fast-math compile switch

You’re welcome to post your code so that someone else could build it and run it; there may be better suggestions if you do so. If you choose to do so, my suggestion would be to provide a complete code; something that someone else could copy, paste, compile and run, without having to add anything or change anything.

I have worked on a simplistic nbody code (it is one of the codes we use in a DLI CUDA training course) and the CPU/GPU comparison usually becomes interesting when the body count exceeds 2^15. 1024 is much too small to be interesting on a large/modern/fast GPU.

robinhood010501 · January 5, 2019, 3:36pm

I’m using GeForce GTX 1060 3GB. I will try to increase bodies amount a bit later. All code can be found here: https://github.com/DarkFuria/nBody
Right now I’m trying to install 415 driver & cuda 9.1, since recommended method don’t provides working of some other programs

Robert_Crovella · January 5, 2019, 4:30pm

I assume you are comparing to this:

[url]https://github.com/DarkFuria/nBody-CPU[/url]

I built both projects on a system with Fedora 27, Xeon X5560 2.8GHz, Quadro K2000 GPU (much slower than your GPU) and CUDA 9.2

Your catalogue1024.csv files between the two projects don’t match. Numerically I don’t think that should matter, but the one in the GPU project has only 1023 lines in it, so that seems incorrect.

In the GPU project, your settings.h sets N_BODYS to 4096 whereas the CPU project sets it to 128. I modified the GPU project to be 1024, and copied the catalogue1024.csv from the CPU project (which has 1024 lines in it) to the GPU project.

In your Makefile I modified sm_61 to sm_30 to match my GPU.

After that I compiled (make) and ran each project with the following command line:

time ./nbody catalogue1024.csv 1000 365

The GPU code (with 1024 bodies) required about 23 minutes to complete, whereas the CPU code (with 128 bodies) is still running with over 35 minutes on the clock.

So if you got these codes to run in 2 seconds (CPU) and 7 seconds (GPU), I’m very impressed with your setup. I don’t see any indication in my setup that the CPU is faster than the GPU, however.

robinhood010501 · January 5, 2019, 5:50pm

Yes, excuse me. It’s my fault. I should add the same testing catalogue.
I run this tests again(catalogue1024.csv 1000 365):
GPU: 5m32s
CPU(with -Ofast): 2.5s
CPU(without -Ofast): (still working, more than 40 minutes)
So, I think that gcc uses some strange optimizations with -Ofast flag…

Robert_Crovella · January 5, 2019, 6:05pm

on a much newer system with a faster GPU, I was able to get the gpu test (1024 bodies) down to 10 minutes.

I killed the CPU test on the old system after 2 hours of execution time.

I killed the CPU test on the new system after 1 hour of execution time.

I see no indication that the CPU is faster than the GPU.

Robert_Crovella · January 5, 2019, 6:08pm

Your makefile for the CPU code already specifies -Ofast

I don’t see anything like 2.5 sec execution time, using that makefile, on gcc 5.4.0 or gcc 7.3.1

So I am unable to reproduce your observation.

robinhood010501 · January 5, 2019, 6:09pm

I see no indication that the CPU is faster than the CPU without -Ofast GCC flag in CPU makefile.

njuffa · January 5, 2019, 7:43pm

That makes me wonder whether your host code is hitting one of those (in-)famous optimizations that take advantage of undefined behavior, such as signed integer overflow. When it detects undefined behavior, the compiler is justified in eliminating all affected code, which may cause a massive reduction in run time. Which version of gcc is being used here?

Have you checked whether compiling with -Ofast results in output anywhere close to the correct / expected results? A program allowed to deliver incorrect results can be made arbitrarily fast.

You may want to consider cranking up warning sensitivity to identify as much questionable code in the host version as possible, and address all reported items. At minimum, use -Wall (which contrary to its name does not turn on all warnings). In addition or alternatively, consider gradually increasing the optimization level, e.g. -O1, -O2, -O3, -O3 -ffast-math, -Ofast. At what stage does run time drop dramatically?

robinhood010501 · January 6, 2019, 9:24am

I’m using gcc 7.3.0 for my host code.
Run time drops dramatically at -O3 optimization level(0.8s per frame in 1024 cluster).
-Wall addition hadn’t shown any warnings.

njuffa · January 6, 2019, 11:10am

Generally, I would not expect more than 20% speed-up between -O2 and -O3, but as far as I recall -O3 enables SIMD auto-vectorization and that could provide a much bigger kick in apps dominated by tight computational loops.

If I interpret your description above correctly, the working set of your data is very small and should be fully contained in the L1/L2 caches of the CPU, while at the same time limiting the parallelism GPUs can exploit (> 10K threads are desirable, but you would be using one tenth that).

How big is the speedup factors between -O2 and -O3? Have you checked whether the app returns correct results at -O3?

robinhood010501 · January 6, 2019, 11:18am

Speedup factor between -O2 & -O3 - around 7 times. No, hadn’t checked results at -O3. -Ofast is quicker than -O3 25 times.

njuffa · January 6, 2019, 4:46pm

These speedup factors seem improbably high too me.

Topic		Replies	Views
Thank you for the great course! CUDA Programming and Performance	0	371	March 27, 2020
N-Body Sim with some Custom Parameters Undergraduate Work CUDA Programming and Performance	0	610	March 16, 2012
Cuda Samples too slow on 980 Ti CUDA Programming and Performance	3	1106	December 24, 2015
Quick sanity check please! (GTX 1080 performance, CUDA Nbody sample) CUDA Programming and Performance	5	2815	November 25, 2016
Question about nbody.exe vs oclnbody.exe speed CUDA Programming and Performance	1	2218	October 15, 2009
Porting Nbody demo cuda sdk Nbody Cuda sdk to .net gass.cuda, directx managed CUDA Programming and Performance	2	1442	August 9, 2010
use nbody test p4 performance is so bad CUDA Programming and Performance	4	941	October 16, 2019
2D n-body simulation optimization CUDA Programming and Performance	11	1368	February 15, 2020
n-body print the CPU FLOP/s maybe a noob question. CUDA Programming and Performance	0	2458	April 27, 2009
nbody-Simulation problem CUDA Programming and Performance	4	2735	February 23, 2009

Low performance on Ubuntu 18.04

Related topics