Binding Error Running NAMD on CUDA Computer

Foxbatk · June 4, 2015, 12:44pm

Good morning, I have what might be an asinine question - I am trying to run the molecular dynamics suite NAMD with CUDA enabled, the stock prebuilt version off their server, and am presumably getting errors with multiple threads trying to share each CUDA device. I’ve spent a few weeks trying to figure this out, read the manual etc. with no luck, but have a simulation I’d really like to get started soon, and it seems like a simple error probably due to some configuration setting. It’s a six week simulation that could be cut down to two, so pardon my anxiousness to get it started :)

It works fine on an 8 processor non CUDA setup.

Computer I am trying to use has 20 processors and 2 CUDA per node, are there any obvious mistakes below?

My run script is :

#PBS -l walltime=01:00:00
#PBS -l nodes=1:ppn=20:gpus=2
#PBS -l feature=‘gpgpu:intel14’
#PBS -l mem=3gb

module swap GNU Intel #school’s tech support told me to add this.

nvidia-smi -c 0 #so that each device isn’t exclusive.

cd ~/Model
~/NAMDCUDA/namd2 +idlepoll +p20 config.in>test.log

The error:

------------- Processor 7 Exiting: Called CmiAbort ------------
Reason: FATAL ERROR: CUDA error binding to device 0 on pe 7: exclusive-thread device already in use by a different thread

*** glibc detected *** /Model/namd2: double free or corruption (fasttop): 0x00000000022ed3d0 ***
/var/spool/torque/mom_priv/jobs/24520386.mgr-04.i.SC: line 16: 95847 Segmentation fault (core dumped)

NAMD Logfile:

Charm++: standalone mode (not using charmrun)
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 1 unique compute nodes (20-way SMP).
Charm++> cpu topology info is gathered in 0.001 seconds.
Info: Built with CUDA version 6000
Did not find +devices i,j,k,… argument, using all
Pe 0 physical rank 0 will use CUDA device of pe 8
Pe 7 physical rank 7 will use CUDA device of pe 8
FATAL ERROR: CUDA error binding to device 0 on pe 7: exclusive-thread device already in use by a different thread
Pe 4 physical rank 4 will use CUDA device of pe 8
FATAL ERROR: CUDA error binding to device 0 on pe 4: exclusive-thread device already in use by a different thread
Program finished after 0.153611 seconds.
FATAL ERROR: CUDA error in cudaGetDeviceProperties on Pe 6 (csn-001 device 0): driver shutting down
FATAL ERROR: CUDA error in cudaGetDeviceProperties on Pe 1 (csn-001 device 0): driver shutting down
FATAL ERROR: CUDA error in cudaGetDeviceProperties on Pe 16 (csn-001 device 0): driver shutting down
FATAL ERROR: CUDA error in cudaGetDeviceProperties on Pe 5 (csn-001 device 0): driver shutting down

Thank you very much for your time in advance.

Robert_Crovella · June 4, 2015, 2:39pm

I believe the essential elements are here:

The command:

nvidia-smi -c 0

is intended to set a GPU to compute mode of default (refer to the man page for nvidia-smi, or nvidia-smi --help), which will allow multiple processes (e.g. multiple MPI or CHARMM ranks) to use a single GPU. However there are at least 2 issues with this:

this command, in all recent versions of CUDA, requires root privilege
multiple ranks sharing a single GPU is really not optimal. If you want to do it, the best approach is to use CUDA MPS, which is has a somewhat involved setup.

Anyway, the current issue is that your GPU(s) are not in “Default” compute mode, they appear to be in “Exclusive Thread” mode, and your attempt to modify this with nvidia-smi -c 0 doesn’t seem to be working (probably because it requires root, there should be a messsage to this effect somewhere in your job output).

If you’re in a cluster, and have access to more than one node, then the right solution is to use multiple nodes, and select two processes per node, and 2 GPUs per node, something like this:

#PBS -l nodes=16:ppn=2:gpus=2

However I’m not sure if that should be gpus=2 or gpus=1, your IT group can probably help, or you can figure it out with trial and error.

If you decide to try this with 20 processes on a single node, but only 2 GPUs, then you will need to get the compute mode thing modified, and even then I would not expect optimal performance.

Foxbatk · June 4, 2015, 2:59pm

Thank you VERY much for your time, I was suspecting that the command was not working as that is what the existing conversations online suggested, and can probably have a more concrete conversation with our cluster IT people.

I just wanted to ensure that I didn’t have a misplaced comment or command line option. Your 2 CPU + 2 GPU suggestion is in hindsight likely the best approach as well.

Topic		Replies	Views
CUDA NAMD two gpu error CUDA Setup and Installation	1	1647	July 26, 2021
Problem in running NAMD on Tesla Personal SuperComputer CUDA Programming and Performance	6	13098	August 25, 2009
run NAMD in GPU workstation with NVIDIA CUDA Programming and Performance	0	1216	July 17, 2013
My NAMD CUDA expirience thus far GTX 260 192sp CUDA Programming and Performance	13	39513	June 18, 2010
MPI causing trouble in memory allocation? CUDA Programming and Performance	5	11955	November 28, 2009
CUDA error cudaStreamSynchronize(stream) and CUDA error in ComputeBondedCUDA CUDA Setup and Installation	0	747	November 23, 2022
Exclusive compute mode doesn't work with multiple GTX295's & 64-bit Linux CUDA Programming and Performance	2	2734	September 17, 2009
Multi GPU question CUDA Programming and Performance	7	5238	August 10, 2009
Using multi-threaded programs with multiple GPUs in EXCLUSIVE_PROCESS compute mode CUDA Programming and Performance	2	4442	July 30, 2014
Failure with independent devices on independent processes Try it yourself! CUDA Programming and Performance	19	3637	March 10, 2011

Binding Error Running NAMD on CUDA Computer

Related topics