Lammps melt 20 million atom ERROR: Insufficient memory on accelerator (../gpu_extra.h:38)

Successfully ran 20 million atom test twice, then started to get the insufficient memory error noted in the title on a 32GB V100.

mpirun -np 1 …/…/src/lmp_mpi -in in.20m.melt -sf gpu -pk gpu 1
LAMMPS (19 Mar 2020)
Lattice spacing in x,y,z = 1.6796 1.6796 1.6796
Created orthogonal box = (0 0 0) to (839.798 167.96 167.96)
1 by 1 by 1 MPI processor grid
Created 20000000 atoms
create_atoms CPU = 1.49302 secs


  • Using acceleration for lj/cut:
  • with 1 proc(s) per device.

Device 0: Tesla V100-PCIE-32GB, 80 CUs, 31/32 GB, 1.4 GHZ (Double Precision)

Initializing Device and compiling on process 0…Done.
Initializing Device 0 on core 0…Done.

ERROR: Insufficient memory on accelerator (…/gpu_extra.h:38)
Last command: run 1000
Cuda driver error 4 in call at file ‘geryon/nvd_device.h’ in line 135.

No issues with the same 20 million atom test for the HIP port on 32GB MI50:


Initializing Device and compiling on process 0…Done.
Initializing Device 0 on core 0…Done.

Setting up Verlet run …
Unit style : lj
Current step : 0
Time step : 0.005
Per MPI rank memory allocation (min/avg/max) = 3193 | 3193 | 3193 Mbytes
Step Temp E_pair E_mol TotEng Press
0 3 -6.7733681 0 -2.2733683 -3.7027174
50 1.669595 -4.7856647 0 -2.2812723 5.6654026
100 1.6524741 -4.7589429 0 -2.2802319 5.7977251
150 1.6456677 -4.7483396 0 -2.2798381 5.8568397
200 1.6440451 -4.7460612 0 -2.2799936 5.870633
250 1.6434487 -4.745369 0 -2.2801961 5.8751957
300 1.6432804 -4.7455 0 -2.2805795 5.875103
350 1.6431086 -4.745489 0 -2.2808262 5.874611
400 1.6428664 -4.7455253 0 -2.2812258 5.8745923
450 1.6429371 -4.745904 0 -2.2814985 5.8727289
500 1.6426633 -4.7458757 0 -2.2818808 5.8726928
550 1.6425342 -4.7459415 0 -2.2821404 5.8724678
600 1.6423274 -4.7460284 0 -2.2825374 5.8717517
650 1.642208 -4.7460985 0 -2.2827866 5.8712047
700 1.6418347 -4.7459344 0 -2.2831825 5.8719311
750 1.6422137 -4.7467664 0 -2.283446 5.8689695
800 1.6416304 -4.746281 0 -2.2838356 5.8705395
850 1.6413065 -4.7460599 0 -2.2841003 5.8712727
900 1.641055 -4.7460654 0 -2.284483 5.8713051
950 1.6412948 -4.7466834 0 -2.2847413 5.8683462
1000 1.6411186 -4.7468204 0 -2.2851427 5.8675664
Loop time of 470.424 on 1 procs for 1000 steps with 20000000 atoms

Performance: 918.321 tau/day, 2.126 timesteps/s
96.8% CPU use with 1 MPI tasks x no OpenMP threads

MPI task timing breakdown:
Section | min time | avg time | max time |%varavg| %total

Pair | 224.09 | 224.09 | 224.09 | 0.0 | 47.64
Neigh | 5.3406e-05 | 5.3406e-05 | 5.3406e-05 | 0.0 | 0.00
Comm | 38.412 | 38.412 | 38.412 | 0.0 | 8.17
Output | 1.3878 | 1.3878 | 1.3878 | 0.0 | 0.30
Modify | 158.78 | 158.78 | 158.78 | 0.0 | 33.75
Other | | 47.75 | | | 10.15

Nlocal: 2e+07 ave 2e+07 max 2e+07 min
Histogram: 1 0 0 0 0 0 0 0 0 0
Nghost: 1.49849e+06 ave 1.49849e+06 max 1.49849e+06 min
Histogram: 1 0 0 0 0 0 0 0 0 0
Neighs: 0 ave 0 max 0 min
Histogram: 1 0 0 0 0 0 0 0 0 0

Total # of neighbors = 0
Ave neighs/atom = 0
Neighbor list builds = 50
Dangerous builds not checked


  Device Time Info (average):

Data Transfer: 49.5213 s.
Data Cast/Pack: 125.1325 s.
Neighbor copy: 0.0002 s.
Neighbor build: 12.3622 s.
Force calc: 47.6733 s.
Device Overhead: 0.2781 s.
Average split: 1.0000.
Threads / atom: 4.
Max Mem / Proc: 28370.52 MB.
CPU Driver_Time: 0.1798 s.
CPU Idle_Time: 86.5202 s.

Please see the log.cite file for references relevant to this simulation

Total wall time: 0:08:01

Not sure where to go from here, NVIDIA support says the driver is installed successfully. Any help is greatly appreciated thanks.