Help with Hardware Bench for PHD Thesis I am lookign for people who can run my code on other GPUs


I am currently writing my PHD thesis. Part of it is the development of LAMMPScuda, a comprehensive USER-package for the widely used molecular dynamics code LAMMPS. My code already supports many differnet material classes (Metals, Granular, Coarse Grained Systems, Semi Conductors, Bio Molecules, Polymers, inorganic glasses and so on) and runs effectively on GPU-clusters with several hundred GPUs.

While I already have a lot of benchmarks in my thesis, I would like to add a graph showing the relative performance of the various available GPUs. I have already GTX280, GTX295, C1060, GTX470 and C2050. I’d like to add all the other CC1.3 or higher GPUs (GTX260, GTX275, GTX460, GTX570,GTX580) as well. And thats where I need your help.

What do you need to help me?

-a linux machine with an CC1.3 or higher GPU

-g++ and make need to be available (but I guess thats a given for a reader of the GPU Computing baord with a linux machine)

How does it work?

-download the program from

-If you have installed cuda in a different folder than /usr/local/cuda you need to modify the “trunk/src/USER-CUDA/Makefile.common” accordingly.

-do the following procedure:

EDIT: I have added a file src/USER-CUDA/Examples/PHD-Thesis-Tests/run_test which should do everything automatically.

Just go to src/USER-CUDA/Examples/PHD-Thesis-Tests/run_test and run “sh run_test”. Otherwise follow the steps below.

(i) go to trunk/src/STUBS and type “make”

(ii) go to trunk/src and type: “make yes-KSPACE” and “make yes-USER-CUDA” (in that order)

(iii) go to trunk/src/USER-CUDA and type “make precision=1” or “make precision=1 arch=20” if you have a fermi GPU

(iv) go to trunk/src and type “make serial precision=1” or “make serial precision=1 arch=20” if you have a fermi GPU

(v) go to trunk/src/USER-CUDA/Examples/PHD-Thesis-Tests and run the tests with

   "../../../lmp_serial < in.melt.cuda > out.melt"

   "../../../lmp_serial < in.silicate-buckingham.cuda > out.silicate"

(vi) go to trunk/src/USER-CUDA/Examples/PHD-Thesis-Tests and report back the output of “grep Loop *” here

(vii) repeat steps (iii) to (vi) with precision=2 instead of precision=1 after deleteing all *.o files in src/USER-CUDA and all files in src/Obj_serial

If you have any problems following this procedure let me know.

If you have any other questions feel free to ask.

Thanks for helping


Results so far:

Device		melt(single)	melt(double)	silicate(single)	silicate(double)

GTX260		26.6		67.9		50.9			136.8

GTX280		23.8		61.2		44.2			118.3

GTX295		23.5		57.1		44.2			113.8

C1060		26.1		61.2		48.0			122.0

GTS450		36.5		79.3		67.4			143.4

GTX470  	14.3		27.4		25.8			58.1

GTX480		11.9		22.6		21.4			44.3

C2050		14.7		23.8		26.0			44.6

C2050ECC	17.6		27.4		32.1			53.0

M2050ECC	17.5		27.0		31.8			52.4

CPU-Results (xn indicates how many processes are run, d indicates dual processor board, h indicates usage of hyperthreading):

CPU		melt		silicate

i7 950 (x4)	157.5		347.8

i7 950 (x8h)	137.5		305.2

X5550 (x4)	165.7		376.1

X5550 (x8d)	82.2		214.4

AMD 6128 (x4)	270.5		552.4

AMD 6128 (x8)	136.0		309.7

AMD 6128 (x16d)	74.0		182.7

Is there a command line switch for the device to use? I could provide data points for C2070 (ECC on, no chance to reboot within the next week since other people are using the CPUs in that box) and a low-end GTS 450.

While there is no commandline switch, LAMMPScuda tries to figure out itself which device to use. More precisily: it generates a list of CUDA devices and sorts them by multiprocessor count. Then it tries to request devices in that order. If they are not in exclusive mode, it will just request the first device in its list, otherwise it will try all until it finds a free one.

If one needs to overwrite that behaviour one can provide a list of devices to use within the input script. While providing multiple GPUs is only usefull if LAMMPScuda is compiled with MPI support, it can also be used to specify a single device. Just add “gpu/node special 1 ID” as options to the “accelerator cuda” command where ID is your desired device ID as reported by deviceQuery of the SDK.



P.S. The GTS450 should be able to run the test as well. So it would be nice if you run it twice one time with:

“accelerator cuda gpu/node special 1 0”

and the second time with

“accelerator cuda gpu/node special 1 1”

Just replace the old “accelerator cuda” lines in the two in.* files.

P.P.S. It would be nice if you could add an info about the clockspeed on consumer cards, since there are a lot of non-reference design Devices out there.

Ah as you might have noticed I misunderstood your intentions, you wanted to know right away how to use the GTS450. You find the answer in the P.S. of my previous post.


The build system in svn trunk I just pulled is broken. Trying to run stage 2 of your instructions gives me this:

~/build/gpulammps-read-only/src$ make yes-USER-CUDA

Installing package USER-CUDA

[: 420: 1: unexpected operator

[: 420: 1: unexpected operator

Further to that, your Makefiles in USER-CUDA needed some modification before I could get the library to build. After I fixed that I couldn’t get the next steps to build, failing with this

pair_morse_coul_long.cpp: In member function ‘virtual void* LAMMPS_NS::PairMorseCoulLong::extract(char*, int&)’:

pair_morse_coul_long.cpp:615: error: ‘strcmp’ was not declared in this scope

make[1]: *** [pair_morse_coul_long.o] Error 1

make[1]: Leaving directory `/home/david/build/gpulammps-read-only/src/Obj_serial'

make: *** [serial] Error 2

so I gave up…

Ok I fixed the Issue with the missing reference. Do a “make no-all” before updating from the svn with “svn update” in the src folder.

In order to Install the CUDA package it should also be possible to go to src/USER-CUDA and do “./ 1”.


This was on a Ubuntu 9.04 64 bit machine (so gcc 4.3.3) with CUDA 3.2.

OK Sorry to all I forgot one more thing for repeating the test in double precision. YOu need to delete the *.o files in the USER-CUDA folder and all files in src/Obj_serial/ before recompiling.

Sorry again

I see we had some problems with Ubuntu earlier already - which were fixed at some point though. Thanks for the info. I have a friend who migth be able to reproduce the problem.


On a stock GTX 470 (607MHz graphics clock, 1674MHz memory clock, 1215MHz process clock)


build/gpulammps-read-only/src/USER-CUDA/Examples/PHD-Thesis-Tests/out.melt:Loop time of 681.948 on 1 procs for 2000 steps with 256000 atoms

in.silicate-buckingham.cuda didn’t run for the precision case=1. I don’t know whether that is expected or not.

precision=2 running now…

I think the USER-CUDA package didnt install correct. The time is roughly what a CPU core needs. If the USER-CUDA package is installed where should be files with *_cuda.cpp and *_cuda.h in the src folder. Mabye you could try “./ 1” in the USER-CUDA folder and compile again?

I have added now a “run_test” script in src/USER-CUDA/Exampels/PHD-Thesis-Tests/ which should automaticely do everything (at least it worked with me on two different machines with a fresh download).

Btw. I really really appreciate your help.



I don’t think I can waste any more time messing around with this, sorry. Perhaps someone more adroit than I can get you some result.

sure no problem. I appreciate your try.

I added results of the GPUs I have access to in the first post.


I cannot find lmp_serial, only lmp_mpi_mpd_cpu, lmp_mpi_mpd_cuda in directory gpulammps-read-only/src/USER-CUDA/Examples

After compiling the lmp_serial should exist in gpulammps-read-only/src

(thats why there are …/…/…/ before lmp_serial).



Added CPU results in first post.

Using device 0: GeForce GTX 260

Using device 0: GeForce GTX 260

Using device 0: GeForce GTX 260

Using device 0: GeForce GTX 260

Binary file lmp_serial-d matches

Binary file lmp_serial-s matches

log.lammps:Loop time of 136.785 on 1 procs for 2000 steps with 11664 atoms

out.melt-d:Loop time of 67.9529 on 1 procs for 2000 steps with 256000 atoms

out.melt-s:Loop time of 26.5719 on 1 procs for 2000 steps with 256000 atoms

out.silicate-d:Loop time of 136.785 on 1 procs for 2000 steps with 11664 atoms

out.silicate-s:Loop time of 50.9264 on 1 procs for 2000 steps with 11664 atoms

run_test:grep Loop *

Binary file lmp_serial-d matches

Binary file lmp_serial-s matches

log.lammps:Pair time (%) = 90.7013 (66.3096)

out.melt-d:Pair time (%) = 47.8504 (70.417)

out.melt-s:Pair time (%) = 16.2782 (61.261)

out.silicate-d:Pair time (%) = 90.7013 (66.3096)

out.silicate-s:Pair time (%) = 24.4041 (47.9203)

run_test:grep ‘Pair time’ *

Binary file lmp_serial-d matches

Binary file lmp_serial-s matches

log.lammps:Neigh time (%) = 4.46387 (3.26343)

out.melt-d:Neigh time (%) = 16.3246 (24.0233)

out.melt-s:Neigh time (%) = 7.2138 (27.1483)

out.silicate-d:Neigh time (%) = 4.46387 (3.26343)

out.silicate-s:Neigh time (%) = 1.77486 (3.48514)

run_test:grep ‘Neigh time’ *

Sorry, doesn’t compile for me. It seems to expect that CUDA is installed in /usr/local/cuda/bin/nvcc. On my machines, it’s installed elsewhere. I unfortunately don’t have time to go trying to figure out how your build system works.

Also, I am getting the “unexpected operator” errors avidday reported.


the install path can be changed in line 17 of the src/USER-CUDA/Makefile.common. We were also now able to reproduce the unrecognized Operator behaviour which avidday reported. In a virtual Ubunutu box we found the same thing, and tracked it back to the bash version not recognizing ‘==’ as an comparison. It expects a single ‘=’ that is now changed in the repository.

Thanks to avidday for identifying the problem btw.

There have been some other compiler problems though in that virtual box. We had to specifically fall back to GCC4.3, but thats probably a thing of the CUDA version.