GPU accelerated LAMMPS running for a while then stop with Cuda driver error 600

Hi there,

I have LAMMPS 7Aug19 version installed with CUDA 11.0. My system has 4 RTX2080Ti GPUs. I’m now testing it. The job starts normally but stopped after some time.

Here’s what I got from the log.lammps

This is the start of the testing, it ran well and I got output normally
LAMMPS (7 Aug 2019)
Reading data file …
orthogonal box = (0.463237 0.311 -200.333) to (299.714 299.472 549.035)
2 by 1 by 5 MPI processor grid
reading atoms …
3366822 atoms
read_data CPU = 3.18743 secs
23415 atoms in group top
23415 atoms in group bottom

Using acceleration for eam/alloy:
with 3 proc(s) per device.

Device 0: GeForce RTX 2080 Ti, 68 CUs, 10/11 GB, 1.5 GHZ (Mixed Precision)
Device 1: GeForce RTX 2080 Ti, 68 CUs, 1.5 GHZ (Mixed Precision)
Device 2: GeForce RTX 2080 Ti, 68 CUs, 1.5 GHZ (Mixed Precision)
Device 3: GeForce RTX 2080 Ti, 68 CUs, 1.5 GHZ (Mixed Precision)

Initializing Device and compiling on process 0…Done.
Initializing Devices 0-3 on core 0…Done.
Initializing Devices 0-3 on core 1…Done.
Initializing Devices 0-3 on core 2…Done.

Setting up Verlet run …
Unit style : metal
Current step : 0
Time step : 0.001
Per MPI rank memory allocation (min/avg/max) = 213.8 | 222.5 | 231 Mbytes
Step Press Pxx Pyy Pzz PotEng KinEng Temp
0 -2359.4445 -2201.5012 -2271.2332 -2605.5992 -17189921 43519.577 100
1000 -40.779454 503.96003 161.00213 -787.30053 -17173038 26638.952 61.211421
2000 -822.42416 -557.51214 -301.46479 -1608.2955 -17173336 26937.646 61.897765
3000 -839.66882 -337.19188 -522.98215 -1658.8324 -17173366 26967.532 61.966439

After about 15 minutes, I got following error info:

16000 616.47407 385.44244 439.3098 1024.67 -17173447 27047.796 62.150872
17000 404.48152 132.87159 213.72393 866.84904 -17173422 27022.852 62.093554
18000 -40.005683 -378.79838 -265.44668 524.22801 -17173410 27011.647 62.067808
19000 -79.560549 -435.62465 -325.25229 522.1953 -17173401 27001.794 62.045166
Cuda driver error 600 in call at file ‘geryon/nvd_timer.h’ in line 99.
Cuda driver error 600 in call at file ‘geryon/nvd_timer.h’ in line 99.
Cuda driver error 600 in call at file ‘geryon/nvd_timer.h’ in line 99.
Cuda driver error 600 in call at file ‘geryon/nvd_timer.h’ in line 99.
Cuda driver error 600 in call at file ‘geryon/nvd_timer.h’ in line 99.
Cuda driver error 600 in call at file ‘geryon/nvd_timer.h’ in line 99.
Cuda driver error 600 in call at file ‘geryon/nvd_timer.h’ in line 99.
Cuda driver error 600 in call at file ‘geryon/nvd_timer.h’ in line 99.
Cuda driver error 600 in call at file ‘geryon/nvd_timer.h’ in line 99.
Cuda driver error 600 in call at file ‘geryon/nvd_timer.h’ in line 99.

MPI_ABORT was invoked on rank 2 in communicator MPI_COMM_WORLD
with errorcode -1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.

[localhost.localdomain:22867] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2193
[localhost.localdomain:22867] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2193
[localhost.localdomain:22867] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2193
[localhost.localdomain:22867] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2193
[localhost.localdomain:22867] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2193
[localhost.localdomain:22867] 9 more processes have sent help message help-mpi-api.txt / mpi-abort
[localhost.localdomain:22867] Set MCA parameter “orte_base_help_aggregate” to 0 to see all help / error messages

Besides, all GPU connection is lost and I need to reboot the system to get them back.

My system spec:

OS: Centos7
NVIDIA DRIVER version: 450.66
CUDA version: 11.0
CPU: Intel® Core™ i9-9820X CPU @ 3.30GHz

I think I had this problem before when using the same machine but running Ubuntu18.04. LAMMPS is also 7Aug19. The CUDA version by then is 10.2 and driver version is 440.xx

A small update:

I ran the same simulation but with 4 cpu cores and 2 GPUs. It’s been an hour and everything is still in good shape

In the simulation that stopped after 15min(the case I show above), I used 10 CPU cores and 4 GPUs.

If the system works with 2 GPUs but fails with 4 GPUs, but only after some time, you could have a thermal or power supply issue.

You can check the temperature of the GPUs with nvidia-smi.

How much system memory does the system have? I will assume 128 GB for a minimal system based on the number of GPUs.

4xRTX 2080Ti        1000W
i9-9820X             165W
128 GB DDR4           50W
motherboard, storage  20W
=========================
total               1235W

Note that CPUs and GPUs use a continuous power rating like TDP (thermal design power) for the nominal wattage, but that short term power spikes (on the order of milliseconds) can significantly exceed this. For a rock solid system, the total nominal power of all system components should therefore not significantly exceed 60% of the nominal wattage of the PSU (power supply unit). So in this case you would need a 2000W PSU (80PLUS Platinum rated would be ideal, 80PLUS Gold rated would be mainstream).

What is the wattage of the power supply in this system?

Thank you for your reply.

The temperature looks fine when I’m using all GPUs. Typically when I’m using all GPUs, the burden on each one gets lighter and the temperature is lower than the case of just using 2 GPUs.

For my machine, it has one i9-9820x installed(TDP=165w). 128GB RAM installed by using 4*32GB corsair memory. The power supply is 1600w(Corsair AX1600i). RTX2080Ti is ASUS Turbo version. The tower case is corsair air 540.
Here’s the link to this GPU:https://www.amazon.com/GeForce-Turbo-Type-C-graphics-TURBO-RTX2080TI-11G/dp/B07GK2LWDL/ref=sr_1_1?dchild=1&keywords=rtx+2080ti+asus+turbo&qid=1600196491&sr=8-1

The setup is quite close to the following link:
https://l7.curtisnorthcutt.com/the-best-4-gpu-deep-learning-rig

Do you have any idea on the error code?

I cannot find the wattage for the ASUS GeForce RTX 2080 Ti 11G Turbo Edition, but based on the information I can find this appears to be a mildly vendor-overclocked card which likely draws more power than a baseline RTX 2080 Ti (250W). So my power computation above was likely a tad too low.

That is cutting it too close. You should be able to run stably with three of your GPUs with that, but four is too much. Usually what happens is one of two things when a sudden upswing in power draw occurs on a heavily loaded system, and the PSU cannot keep up:

(1) System-wide voltage drop occurs. The system senses this and applies a spontaneous hard reboot

(2) Voltage drop occurs in one or several of the GPUs, which screws up the timing including at the PCIe interface. The driver finds that the “GPU fell off the bus”, i.e. it can no longer communicate with the GPU. If you check your system logs, you will likely find just such a message.

I could dig around the internet for the error code (I am not even sure where it is reported), but so could you. Your system configuration and the description of the observed symptoms is entirely consistent with my diagnosis of an underpowered (in the electrical sense) system.

Thank you for your help!

I now have the system log file(part of the whole log file indicating ‘GPU fell off the bus’). It is attached to this email as ‘crash_report’

I also created an Nvidia bug report and you can find it in the attachment. Maybe we can find some clue here?

nvidia-bug-report.log.gz (3.02 MB)

(Attachment crash_report is missing)

Here is your problem:

Sep 14 12:48:10 localhost kernel: NVRM: Xid (PCI:0000:68:00): 79, pid=22881, GPU has fallen off the bus.
Sep 14 12:48:10 localhost kernel: NVRM: GPU 0000:68:00.0: GPU has fallen off the bus.
Sep 14 12:48:10 localhost kernel: NVRM: Xid (PCI:0000:19:00): 79, pid=22881, GPU has fallen off the bus.
Sep 14 12:48:10 localhost kernel: NVRM: GPU 0000:19:00.0: GPU has fallen off the bus.
Sep 14 12:48:10 localhost kernel: NVRM: Xid (PCI:0000:1a:00): 79, pid=22881, GPU has fallen off the bus.
Sep 14 12:48:10 localhost kernel: NVRM: GPU 0000:1a:00.0: GPU has fallen off the bus.
Sep 14 12:48:10 localhost kernel: NVRM: Xid (PCI:0000:67:00): 79, pid=22881, GPU has fallen off the bus.
Sep 14 12:48:10 localhost kernel: NVRM: GPU 0000:67:00.0: GPU has fallen off the bus.
Sep 15 12:12:30 localhost kernel: NVRM: Xid (PCI:0000:68:00): 79, pid=32740, GPU has fallen off the bus.
Sep 15 12:12:30 localhost kernel: NVRM: GPU 0000:68:00.0: GPU has fallen off the bus.
Sep 15 12:12:30 localhost kernel: NVRM: Xid (PCI:0000:19:00): 79, pid=32740, GPU has fallen off the bus.
Sep 15 12:12:30 localhost kernel: NVRM: GPU 0000:19:00.0: GPU has fallen off the bus.
Sep 15 12:12:30 localhost kernel: NVRM: Xid (PCI:0000:1a:00): 79, pid=32740, GPU has fallen off the bus.
Sep 15 12:12:30 localhost kernel: NVRM: GPU 0000:1a:00.0: GPU has fallen off the bus.
Sep 15 12:12:30 localhost kernel: NVRM: Xid (PCI:0000:67:00): 79, pid=32740, GPU has fallen off the bus.
Sep 15 12:12:30 localhost kernel: NVRM: GPU 0000:67:00.0: GPU has fallen off the bus.

Root cause (with 95% certainty): Insufficient power supply to the GPU due to under-dimensioned PSU.

Possible solutions:

(1) Replace current power supply unit with a 2000W PSU (this maybe difficult if you are in a country with a 120V electrical system). PSUs sold in the US are typically limited to 1600W since most common outlets are configured to supply 15A of current. Some 20A outlets may be available in newer buildings, but you need a different socket/plug.
(2) Restrict software usage to no more than three RTX 2080 TIs at any given time while running LAMMPS (try CUDA_VISIBLE_DEVICES)
(3) Replace the four RTX 2080TIs with four GPUs drawing no more than 190W each.

Although I do not recommend this, you could try a higher-efficiency 1600W power supply. PSUs with the highest 80PLUS rating (80PLUS Titanium) often are designed with higher-quality components and greater engineering margins which just might provide enough reserves to tide over power spikes. But this relies on luck, not a good idea IMHO if you are dependent on a stable system that can be put to work 24/7. You also would be looking at an expenditure of US$ 500+ for such a PSU, while success is not guaranteed at all.

Hi there.

I got a 2000W PSU and I still have the same bug. Could you take a look at the bug report file? I used a 110V to 220V converter to draw enough power from the PSU. The converter is working properly.

Thank you,nvidia-bug-report.log.gz (2.5 MB)

I used a 110V to 220V converter to draw enough power from the PSU. The converter is working properly

Converter: Yikes! Please don’t kill yourself or burn down the building. How many amps @ 110V is this contraption drawing when the machine is under full load?

Per the log, you have the same kind off issue as before:

Oct  1 11:13:05 localhost kernel: NVRM: Xid (PCI:0000:68:00): 79, pid=4682, GPU has fallen off the bus.
Oct  1 11:13:05 localhost kernel: NVRM: GPU 0000:68:00.0: GPU has fallen off the bus.
Oct  1 11:13:05 localhost kernel: NVRM: Xid (PCI:0000:19:00): 79, pid=4682, GPU has fallen off the bus.
Oct  1 11:13:05 localhost kernel: NVRM: GPU 0000:19:00.0: GPU has fallen off the bus.
Oct  1 11:13:05 localhost kernel: NVRM: Xid (PCI:0000:1a:00): 79, pid=4682, GPU has fallen off the bus.
Oct  1 11:13:05 localhost kernel: NVRM: GPU 0000:1a:00.0: GPU has fallen off the bus.
Oct  1 11:13:05 localhost kernel: NVRM: Xid (PCI:0000:67:00): 79, pid=4682, GPU has fallen off the bus.
Oct  1 11:13:05 localhost kernel: NVRM: GPU 0000:67:00.0: GPU has fallen off the bus.
Oct  1 11:58:08 localhost kernel: NVRM: Xid (PCI:0000:68:00): 79, pid=0, GPU has fallen off the bus.
Oct  1 11:58:08 localhost kernel: NVRM: GPU 0000:68:00.0: GPU has fallen off the bus.
Oct  1 11:58:08 localhost kernel: NVRM: Xid (PCI:0000:19:00): 79, pid=0, GPU has fallen off the bus.
Oct  1 11:58:08 localhost kernel: NVRM: GPU 0000:19:00.0: GPU has fallen off the bus.
Oct  1 11:58:08 localhost kernel: NVRM: Xid (PCI:0000:1a:00): 79, pid=0, GPU has fallen off the bus.
Oct  1 11:58:08 localhost kernel: NVRM: GPU 0000:1a:00.0: GPU has fallen off the bus.
Oct  1 11:58:08 localhost kernel: NVRM: Xid (PCI:0000:67:00): 79, pid=0, GPU has fallen off the bus.
Oct  1 11:58:08 localhost kernel: NVRM: GPU 0000:67:00.0: GPU has fallen off the bus.

Since all four GPUs seem to be involved, it does not look like a problem with a particular GPU (defective GPUs are rare, but do happen).

Have you tried running with three GPUs as I had suggested? Does the machine operate stably with only three GPUs?

If this is a system you built yourself, you may have other issues than just raw PSU power, e.g. with power cabling for each GPU. You need a dedicated strand running from the PSU to each GPU, with no converters, Y-splitters, or daisy-chaining. Get on-site help from someone who has built HPC systems before.

If this is a system you acquired pre-built and fully configured, I would suggest you contact the vendor and have them resolve the stability issues under heavy load. After all, you paid for a working system.

I am unable to offer further suggestions based on remote diagnosis over the internet and will disengage from this thread.