Problem in running NAMD on Tesla Personal SuperComputer

Hello Everyone,

                 Previously i tried to execute NAMD on C870 card but it didn't work,  it gives error "Charm++ fatal error:

FATAL ERROR: CUDA error allocating force table: feature is not yet implemented"

                 Now, I am using Tesla supercomputer desktop system with 4 Tesla C1060 card. And the motherboard is having nForce780a SLI graphics card. And i am not confirmed whether it is cuda enabled or not but my application i.e NAMD detect nForce 780a card and instead of running only on four C1060 cards it goes to nForce too.And when i tried to execute with one card it only goes to nForce780.

The command i give is like this:

$ charmrun namd2 +p1 …/…/NAMD-with-cuda/NAMD-Data/apoa1/apoa1.namd 2>&1 | tee namd2_apoa1_1P

Running on 1 processors: namd2 …/…/NAMD-with-cuda/NAMD-Data/apoa1/apoa1.namd
charmrun> /usr/bin/setarch x86_64 -R mpirun -np 1 namd2 …/…/NAMD-with-cuda/NAMD-Data/apoa1/apoa1.namd
Charm++> Running on MPI version: 2.1 multi-thread support: MPI_THREAD_SINGLE (max supported: MPI_THREAD_SINGLE)

Did not find +devices i,j,k,… argument, defaulting to (pe + 1) % deviceCount
Pe 0 binding to CUDA device 1 on samir-desktop: ‘nForce 780a SLI’ Mem: 125MB Rev: 1.1
Charm++> cpu topology info is being gathered!
Charm++> 1 unique compute nodes detected!
.

.
.
.
.
.
.
Pe 0 has 144 local and 0 remote patches and 3888 local and 0 remote computes.
allocating 51 MB of memory on GPU
FATAL ERROR: CUDA error malloc everything: out of memory
------------- Processor 0 Exiting: Called CmiAbort ------------
Reason: FATAL ERROR: CUDA error malloc everything: out of memory

[0] Stack Traceback:
[0] CmiAbort+0x2b [0x8600e1]
[1] _Z8NAMD_diePKc+0x56 [0x4ec846]
[2] _Z13cuda_errcheckPKc+0x5e [0x5ef46e]
[3] _Z21cuda_bind_patch_pairsPK10patch_pairiPK10force_listiiii+0
x1e4 [0x7d58b 4]
[4] _ZN20ComputeNonbondedCUDA6doWorkEv+0x3c2 [0x5f02e2]
[5] _ZN19CkIndex_WorkDistrib30_call_enqueueCUDA_LocalWorkMsgEPvP
11WorkDistrib+ 0xd [0x7a973d]
[6] CkDeliverMessageFree+0x38 [0x80a82f]
[7] _Z15_processHandlerPvP11CkCoreState+0x183 [0x80de9d]
[8] CmiHandleMessage+0x27 [0x8617b6]
[9] CsdScheduleForever+0x64 [0x8632c8]
[10] CsdScheduler+0xd [0x86335f]
[11] _ZN9ScriptTcl3runEPc+0xe1 [0x77f2d1]
[12] _Z18after_backend_initiPPc+0x25d [0x4f0cad]
[13] main+0x24 [0x4f0d94]
[14] __libc_start_main+0xe6 [0x7ffff5ee8466]
[15] namd2 [0x4ebae9]

MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.


mpirun has exited due to process rank 0 with PID 8486 on
node samir-desktop exiting without calling “finalize”. This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).

And with two threads it detect one Tesla C1060 and nForce card…and give same error.
I want to know whether there is problem of memory with nForce card or it is not supporting CUDA or some other problem is there?
If anybody has done NAMD on any GPU card, let me know about that.

Thanks in advance.

Regards,
Deepti

You can use nvidia-smi to exclude the nForce card.

If you want to use all four C1060, you need to start NAMD with 4 processes:

charmrun namd2 +p4 …/…/NAMD-with-cuda/NAMD-Data/apoa1/apoa1.namd

Thank you sir for your reply.

I got my problem solved when i run with 4 processes and used +devices option with namd2 executable, where i menstion the id for selected gpus.

Thanks & Regards,

Deepti

Hello to All,

Has anybody experienced the degraded performance of NAMD with CUDA ? I have executed namd on my Intel Dual Xeon machine and on Tesla Personal super computer. I am using the apoa1 test case and it is taking 7m0.616s on tesla m/c (with 4 proceses, one on each device) and on intel xeon m/c it is taking 4m5.530s with 4 processes.

Please tell me why it is giving such performance? As i have already read about the performance gain for namd.

Thanks & Regards,
Deepti

Hello Sir,

How can i exclude my nforce card using nvidia-smi. I have gone through its help but i won’t find any relevant option for doing this. Can you please help me.

Thanks & Regards,

Deepti

nvidia-smi --help

nvidia-smi [OPTION1] [OPTION2 ARG] ...

NVIDIA System Management Interface program for Tesla S870

	-h, --help								  Show usage and exit

	-x, --xml-format							Produce XML log (to stdout by default, unless

												a file is specified with -f or --filename=FILE

	-l, --loop-continuously					 Probe continuously, clobbers old logfile if not printing to stdout

	-t NUM, --toggle-led=NUM					Toggle LED state for Unit <NUM>

	-i SEC, --interval=SEC					  Probe once every <SEC> seconds if the -l option

												is selected (default and minimum: 1 second)

	-f FILE, --filename=FILE					Specify log file name

	--gpu=GPUID --compute-mode-rules=RULESET	Set rules for compute programs

												where GPUID is the number of the GPU (starting at zero) in the system

												and RULESET is one of:

												0: Normal mode

												1: Compute-exclusive mode (only one compute program per GPU allowed)

												2: Compute-prohibited mode (no compute programs may run on this GPU)

	-g GPUID -c RULESET						 (short form of the previous command)

	--gpu=GPUID --show-compute-mode-rules

	-g GPUID -s								 (short form of the previous command)

	-L, --list-gpus

	-lsa, --list-standalone-gpus-also		   Also list standalone GPUs in the system along with their temperatures.

												Can be used with the -l, --loop-continuously option

	-lso, --list-standalone-gpus-only		   Only list standalone GPUs in the system along with their temperatures.

												Can be used with the -l, --loop-continuously option

So if you want to exclude GPU 0, as root:

nvidia-smi -g 0 -c 2

Regarding your other problem, if you are printing the energy every step, NAMD used to slow down considerably. Try to reduce the frequency of the output and see if it solves your problem. For NAMD related question, you should post on the NAMD mailing list BTW.

Thank you Sir, really i skiped that option.

Regards,

Deepti