My NAMD CUDA expirience thus far GTX 260 192sp

Hello all,

I am really hoping to sell my boss on the idea of CUDA. I have a ridiculously stubborn Research Adviser that would rather spend thousands on racks of old dual socket Barcelona U2 than take a chance on CUDA (even outside of our cluster :shrug:)

So I decided to venture a bit on my own. I was fortunate enough to attend a ‘Many Cores’ Seminar at the Ohio Supercomputer center last year and my cuda programming is slowing developing, but the only place my boss would care to see improvement is on our NAMD bio system jobs.

My desktop rig is not that hot for crunching numbers. I would prefer some more cores (maybe I’ll toss in a q9650 :shrug:)

E8600 @ 3.96Ghz (this is a dual core wolfdale with 6mb of l2)

GTX 260 (192sp cards) @ stock 576mhz core

CentOS 5.5

So I tossed on NAMD 2.7 as well as the NAMD 2.7 CUDA version and am running two separate systems.

System 1: Water box (50 x 50 x 50) in which I am growing a micelle from Perfluorooctanoate. The system is neutralized with sodium. Please ignore the periodic boundary caused flip flopping. The box is too small for this one, but it was a nice test npt simulation. (This was my undergrad senior project, I just graduated :) )

Here I am just representing the surfactant resi’s

It’s fun to watch them ‘bud’ from the water voids in the .dcd’s

System 2: I have a lipid bilayer than I am playing with. This system has many more atoms, so I was hoping to see some better CUDA benefits to show the boss.



My WallTimes were cut in HALF for my micelle system. Considering all of the bonded interactions are still only taking place on my dual core (albeit @ 3.9Ghz) I was very happy to see this. I am still waiting for my cpu only run of system 2. Extrapolating from past data, it looks like it may be about a 130% speed up.

Conclusions/Remarks: I am really trying to sell my boss on the idea of putting GTX470’s in a couple of the cluster machines or some of the desktops for submitting these type of jobs too. And I was hoping to get some ammunition from you guys. Is there anything obvious that I am missing?

I noticed that only one of my two GTX260’s is being used for the non-bonded calculations and it barely heats up at all. Is there any way I can unload more work on the GPU?

PS. I submit with the following command lines:

I cannot seem to find any options that would allow me to tweak how my card is being used here. I have grown to hate the NAMD manuals…

Thank you for your time,


I’d be happy to run a NAMD benchmark on my GTX 470 at work if you can give me instructions on what to type. (Use small words: I’m a particle physicist, so I don’t know what molecules are. :) )

Edit: Also, does NAMD make efficient use of multiple GPUs? The system with the GTX 470 also has three GTX 295 cards in it (7 GPUs!), so if you want a massively multi-GPU measurement, I can do that too.

NAMD certainly makes use of multiple GPUs.

I am also trying to bench NAMD 2.7b CUDA with MPICH2 in comparison to NAMD 2.7b MPICH2. So far I have a 3x boost. When I have more results I might as well post them here.

The command I am using is

./charmrun +p4 ./namd2 +idlepoll +devices 0,0,0,0 run.txt

on an Intel® Core™2 Quad CPU @ 2.40GHz with a Tesla C1060.

Copying from the notes.txt

Should I not have SLI enabled? NAMD is telling me that it only is seeing one device. :shrug:

AWESOME! Thank you… that makes much more sense.

I would really like to the see the speed up from a GTX 470! How many CPU’s are your using to be able to utilized all of those GPU’s?

I guess I am confused as to how I could assign GPU #2 to processes as well (in my case). With my dual core, I only have the ‘two’ NAMD processes. I guess the best I could do is the following?

Thank you for all the responses guys! I have always loved the NVIDIA forums. I think this would be a neat place to post some NAMD CUDA renderings and speedup times.


What CUDA driver are you using? SLI used to hide devices from CUDA, but I thought that was fixed a year ago. (SLI provides no benefit to CUDA, so unless you need it for OpenGL, you should turn it off.)

It is a quad-core 2.66 GHz Intel Core i7 processor with hyperthreading turned for 8 “virtual” cores. I find that in this current generation, hyperthreading enabled with 8 processes running gives me 50% more throughput on my jobs than 4 processes. (Edit: that’s for CPU-only jobs. Most of the GPU jobs this computer runs are pretty light on the CPU side, so the number of CPU cores is less critical.)

OK, the 64-bit CUDA version seems to be working on my GPU node here, but I have no simulation configuration files to run. Can you point me at the files corresponding to the speed tests you posted above?

If I am not mistaken, you can have more than 2 processes on a dual core machine.
For example you can have 4 processes at 50% each, instead of 2 and 100% per core.


Unfortunately I cannot. I’m running a TraPPE UA forcefield for my surfactant that I made, and I do not think I would be allowed to release it. :(

For CPU only, on some 920’s, I was seeing a ~12% speedup from HTT.

OK, do you have a suggested generic configuration file to benchmark?

I’m going to throw together a simple liquid simulation. Can I email the files to you?

When you guys run NAMD with CUDA, make sure the outputEnergies config parameter is a large number, as any timestep that outputs energies currently falls back to the host. If you do it too often, you will slow down the GPU (or rather the GPU will be idle for more timesteps than it ought to…)


Hi tachyon John,

I output energies and pressures between 10 to 20k for my npt simulations. I don’t really need to keep an eye on them at that point.

Speaking about tachyon, is there a way to perform the tachyon ray trace / rendering for VMD using the gpu instead of my cpu? Some of our renderings take as much as an hour :shrug: (using vmd 1.8.7 with the following command)


I haven’t had a chance to start working on adapting Tachyon for CUDA/OptiX, but it’s on my TODO list, believe me…


John Stone