Hi there,
I have LAMMPS 7Aug19 version installed with CUDA 11.0. My system has 4 RTX2080Ti GPUs. I’m now testing it. The job starts normally but stopped after some time.
Here’s what I got from the log.lammps
This is the start of the testing, it ran well and I got output normally
LAMMPS (7 Aug 2019)
Reading data file …
orthogonal box = (0.463237 0.311 -200.333) to (299.714 299.472 549.035)
2 by 1 by 5 MPI processor grid
reading atoms …
3366822 atoms
read_data CPU = 3.18743 secs
23415 atoms in group top
23415 atoms in group bottom
Using acceleration for eam/alloy:
with 3 proc(s) per device.
Device 0: GeForce RTX 2080 Ti, 68 CUs, 10/11 GB, 1.5 GHZ (Mixed Precision)
Device 1: GeForce RTX 2080 Ti, 68 CUs, 1.5 GHZ (Mixed Precision)
Device 2: GeForce RTX 2080 Ti, 68 CUs, 1.5 GHZ (Mixed Precision)
Device 3: GeForce RTX 2080 Ti, 68 CUs, 1.5 GHZ (Mixed Precision)
Initializing Device and compiling on process 0…Done.
Initializing Devices 0-3 on core 0…Done.
Initializing Devices 0-3 on core 1…Done.
Initializing Devices 0-3 on core 2…Done.
Setting up Verlet run …
Unit style : metal
Current step : 0
Time step : 0.001
Per MPI rank memory allocation (min/avg/max) = 213.8 | 222.5 | 231 Mbytes
Step Press Pxx Pyy Pzz PotEng KinEng Temp
0 -2359.4445 -2201.5012 -2271.2332 -2605.5992 -17189921 43519.577 100
1000 -40.779454 503.96003 161.00213 -787.30053 -17173038 26638.952 61.211421
2000 -822.42416 -557.51214 -301.46479 -1608.2955 -17173336 26937.646 61.897765
3000 -839.66882 -337.19188 -522.98215 -1658.8324 -17173366 26967.532 61.966439
…
After about 15 minutes, I got following error info:
16000 616.47407 385.44244 439.3098 1024.67 -17173447 27047.796 62.150872
17000 404.48152 132.87159 213.72393 866.84904 -17173422 27022.852 62.093554
18000 -40.005683 -378.79838 -265.44668 524.22801 -17173410 27011.647 62.067808
19000 -79.560549 -435.62465 -325.25229 522.1953 -17173401 27001.794 62.045166
Cuda driver error 600 in call at file ‘geryon/nvd_timer.h’ in line 99.
Cuda driver error 600 in call at file ‘geryon/nvd_timer.h’ in line 99.
Cuda driver error 600 in call at file ‘geryon/nvd_timer.h’ in line 99.
Cuda driver error 600 in call at file ‘geryon/nvd_timer.h’ in line 99.
Cuda driver error 600 in call at file ‘geryon/nvd_timer.h’ in line 99.
Cuda driver error 600 in call at file ‘geryon/nvd_timer.h’ in line 99.
Cuda driver error 600 in call at file ‘geryon/nvd_timer.h’ in line 99.
Cuda driver error 600 in call at file ‘geryon/nvd_timer.h’ in line 99.
Cuda driver error 600 in call at file ‘geryon/nvd_timer.h’ in line 99.
Cuda driver error 600 in call at file ‘geryon/nvd_timer.h’ in line 99.
MPI_ABORT was invoked on rank 2 in communicator MPI_COMM_WORLD
with errorcode -1.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
[localhost.localdomain:22867] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2193
[localhost.localdomain:22867] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2193
[localhost.localdomain:22867] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2193
[localhost.localdomain:22867] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2193
[localhost.localdomain:22867] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2193
[localhost.localdomain:22867] 9 more processes have sent help message help-mpi-api.txt / mpi-abort
[localhost.localdomain:22867] Set MCA parameter “orte_base_help_aggregate” to 0 to see all help / error messages
Besides, all GPU connection is lost and I need to reboot the system to get them back.
My system spec:
OS: Centos7
NVIDIA DRIVER version: 450.66
CUDA version: 11.0
CPU: Intel(R) Core™ i9-9820X CPU @ 3.30GHz
I think I had this problem before when using the same machine but running Ubuntu18.04. LAMMPS is also 7Aug19. The CUDA version by then is 10.2 and driver version is 440.xx