Binding Error Running NAMD on CUDA Computer

Good morning, I have what might be an asinine question - I am trying to run the molecular dynamics suite NAMD with CUDA enabled, the stock prebuilt version off their server, and am presumably getting errors with multiple threads trying to share each CUDA device. I’ve spent a few weeks trying to figure this out, read the manual etc. with no luck, but have a simulation I’d really like to get started soon, and it seems like a simple error probably due to some configuration setting. It’s a six week simulation that could be cut down to two, so pardon my anxiousness to get it started :)

It works fine on an 8 processor non CUDA setup.

Computer I am trying to use has 20 processors and 2 CUDA per node, are there any obvious mistakes below?

My run script is :

#PBS -l walltime=01:00:00
#PBS -l nodes=1:ppn=20:gpus=2
#PBS -l feature=‘gpgpu:intel14’
#PBS -l mem=3gb

module swap GNU Intel #school’s tech support told me to add this.

nvidia-smi -c 0 #so that each device isn’t exclusive.

cd ~/Model
~/NAMDCUDA/namd2 +idlepoll +p20 config.in>test.log

The error:

------------- Processor 7 Exiting: Called CmiAbort ------------
Reason: FATAL ERROR: CUDA error binding to device 0 on pe 7: exclusive-thread device already in use by a different thread

*** glibc detected *** /Model/namd2: double free or corruption (fasttop): 0x00000000022ed3d0 ***
/var/spool/torque/mom_priv/jobs/24520386.mgr-04.i.SC: line 16: 95847 Segmentation fault (core dumped)

NAMD Logfile:

Charm++: standalone mode (not using charmrun)
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 1 unique compute nodes (20-way SMP).
Charm++> cpu topology info is gathered in 0.001 seconds.
Info: Built with CUDA version 6000
Did not find +devices i,j,k,… argument, using all
Pe 0 physical rank 0 will use CUDA device of pe 8
Pe 7 physical rank 7 will use CUDA device of pe 8
FATAL ERROR: CUDA error binding to device 0 on pe 7: exclusive-thread device already in use by a different thread
Pe 4 physical rank 4 will use CUDA device of pe 8
FATAL ERROR: CUDA error binding to device 0 on pe 4: exclusive-thread device already in use by a different thread
Program finished after 0.153611 seconds.
FATAL ERROR: CUDA error in cudaGetDeviceProperties on Pe 6 (csn-001 device 0): driver shutting down
FATAL ERROR: CUDA error in cudaGetDeviceProperties on Pe 1 (csn-001 device 0): driver shutting down
FATAL ERROR: CUDA error in cudaGetDeviceProperties on Pe 16 (csn-001 device 0): driver shutting down
FATAL ERROR: CUDA error in cudaGetDeviceProperties on Pe 5 (csn-001 device 0): driver shutting down

Thank you very much for your time in advance.

I believe the essential elements are here:

The command:

nvidia-smi -c 0

is intended to set a GPU to compute mode of default (refer to the man page for nvidia-smi, or nvidia-smi --help), which will allow multiple processes (e.g. multiple MPI or CHARMM ranks) to use a single GPU. However there are at least 2 issues with this:

  1. this command, in all recent versions of CUDA, requires root privilege
  2. multiple ranks sharing a single GPU is really not optimal. If you want to do it, the best approach is to use CUDA MPS, which is has a somewhat involved setup.

Anyway, the current issue is that your GPU(s) are not in “Default” compute mode, they appear to be in “Exclusive Thread” mode, and your attempt to modify this with nvidia-smi -c 0 doesn’t seem to be working (probably because it requires root, there should be a messsage to this effect somewhere in your job output).

If you’re in a cluster, and have access to more than one node, then the right solution is to use multiple nodes, and select two processes per node, and 2 GPUs per node, something like this:

#PBS -l nodes=16:ppn=2:gpus=2

However I’m not sure if that should be gpus=2 or gpus=1, your IT group can probably help, or you can figure it out with trial and error.

If you decide to try this with 20 processes on a single node, but only 2 GPUs, then you will need to get the compute mode thing modified, and even then I would not expect optimal performance.

Thank you VERY much for your time, I was suspecting that the command was not working as that is what the existing conversations online suggested, and can probably have a more concrete conversation with our cluster IT people.

I just wanted to ensure that I didn’t have a misplaced comment or command line option. Your 2 CPU + 2 GPU suggestion is in hindsight likely the best approach as well.