Good morning, I have what might be an asinine question - I am trying to run the molecular dynamics suite NAMD with CUDA enabled, the stock prebuilt version off their server, and am presumably getting errors with multiple threads trying to share each CUDA device. I’ve spent a few weeks trying to figure this out, read the manual etc. with no luck, but have a simulation I’d really like to get started soon, and it seems like a simple error probably due to some configuration setting. It’s a six week simulation that could be cut down to two, so pardon my anxiousness to get it started :)
It works fine on an 8 processor non CUDA setup.
Computer I am trying to use has 20 processors and 2 CUDA per node, are there any obvious mistakes below?
My run script is :
#PBS -l walltime=01:00:00
#PBS -l nodes=1:ppn=20:gpus=2
#PBS -l feature=‘gpgpu:intel14’
#PBS -l mem=3gb
module swap GNU Intel #school’s tech support told me to add this.
nvidia-smi -c 0 #so that each device isn’t exclusive.
cd ~/Model
~/NAMDCUDA/namd2 +idlepoll +p20 config.in>test.log
The error:
------------- Processor 7 Exiting: Called CmiAbort ------------
Reason: FATAL ERROR: CUDA error binding to device 0 on pe 7: exclusive-thread device already in use by a different thread
*** glibc detected *** /Model/namd2: double free or corruption (fasttop): 0x00000000022ed3d0 ***
/var/spool/torque/mom_priv/jobs/24520386.mgr-04.i.SC: line 16: 95847 Segmentation fault (core dumped)
NAMD Logfile:
Charm++: standalone mode (not using charmrun)
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 1 unique compute nodes (20-way SMP).
Charm++> cpu topology info is gathered in 0.001 seconds.
Info: Built with CUDA version 6000
Did not find +devices i,j,k,… argument, using all
Pe 0 physical rank 0 will use CUDA device of pe 8
Pe 7 physical rank 7 will use CUDA device of pe 8
FATAL ERROR: CUDA error binding to device 0 on pe 7: exclusive-thread device already in use by a different thread
Pe 4 physical rank 4 will use CUDA device of pe 8
FATAL ERROR: CUDA error binding to device 0 on pe 4: exclusive-thread device already in use by a different thread
Program finished after 0.153611 seconds.
FATAL ERROR: CUDA error in cudaGetDeviceProperties on Pe 6 (csn-001 device 0): driver shutting down
FATAL ERROR: CUDA error in cudaGetDeviceProperties on Pe 1 (csn-001 device 0): driver shutting down
FATAL ERROR: CUDA error in cudaGetDeviceProperties on Pe 16 (csn-001 device 0): driver shutting down
FATAL ERROR: CUDA error in cudaGetDeviceProperties on Pe 5 (csn-001 device 0): driver shutting down
Thank you very much for your time in advance.