Multi-GPUs stuck/freeze while one GPU works well

nightshadow9710 · February 5, 2021, 2:37am

Using 3 Tesla T4 for training. Works well with cuda(0) , cuda(1), cuda(2). But fails using
“nn.DataParraller(model)”
or
“subNetwork1.cuda(0)” “subNetwork.cuda(1)”.

Checking nvidia-smi,
It shows each of GPU has been occupied some of its space like n/16.0G but GPU-utils remain 0%.
And it just stuck/freeze there. Also using kill -9 cannot stop the process.
Only way to restore it is to reboot manually.
Can anyone help me fix this problem?

njuffa · February 5, 2021, 11:42am

Does dmesg show any GPU-related error messages? You would primarily be looking for messages starting with NVRM (NVIDIA resource manager).

What does this mean? Could you clarify, please?

Exactly stuck how? I assume the display stops being updated (e.g. as indicated by a lack of clock changing) , it does not accept any keyboard input, but you can ssh in from another machine to try to kill -9 the process?

nightshadow9710 · February 7, 2021, 1:44am

Hi njuffa,

I tried to use single GPU to train my network, with code line model.cuda(0/1/2), because I am using 3 Tesla T4 on my server.

It was not a display stuck. I could still work on other things but the python process just stuck there. It occupied the memory of GPU and cannot release even if I type kill -9 command on terminal.

I copied messages starting with NVRM , and it looks like this:
NVRM: loading NVIDIA UNIX x86_64 Kernel Module 410.129 Sun Jul 21 07:02:47 CDT 2019 (using threaded interrupts)
[ 7.564307] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 410.129 Sun Jul 21 07:01:32 CDT 2019
[ 7.566077] [drm] [nvidia-drm] [GPU ID 0x00001b00] Loading driver
[ 7.566078] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:1b:00.0 on minor 1
[ 7.566147] [drm] [nvidia-drm] [GPU ID 0x00001d00] Loading driver
[ 7.566148] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:1d:00.0 on minor 2
[ 7.566205] [drm] [nvidia-drm] [GPU ID 0x00004000] Loading driver
[ 7.566206] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:40:00.0 on minor 3
[ 7.606716] EDAC MC0: Giving out device to module skx_edac controller Skylake Socket#0 IMC#0: DEV 0000:3a:0a.0 (INTERRUPT)
[ 7.606823] EDAC MC1: Giving out device to module skx_edac controller Skylake Socket#0 IMC#1: DEV 0000:3a:0c.0 (INTERRUPT)
[ 7.607080] EDAC MC2: Giving out device to module skx_edac controller Skylake Socket#1 IMC#0: DEV 0000:ae:0a.0 (INTERRUPT)
[ 7.607525] EDAC MC3: Giving out device to module skx_edac controller Skylake Socket#1 IMC#1: DEV 0000:ae:0c.0 (INTERRUPT)
[ 7.619597] intel_rapl: Found RAPL domain package
[ 7.619601] intel_rapl: Found RAPL domain dram
[ 7.619603] intel_rapl: DRAM domain energy unit 15300pj
[ 7.619604] intel_rapl: RAPL package 0 domain package locked by BIOS
[ 7.620272] intel_rapl: Found RAPL domain package
[ 7.620276] intel_rapl: Found RAPL domain dram
[ 7.620278] intel_rapl: DRAM domain energy unit 15300pj
[ 7.620279] intel_rapl: RAPL package 1 domain package locked by BIOS
[ 7.764190] Adding 998396k swap on /dev/sda5. Priority:-2 extents:1 across:998396k FS
[ 8.251399] IPv6: ADDRCONF(NETDEV_UP): enp94s0f1: link is not ready
[ 8.451799] ixgbe 0000:5e:00.1: registered PHC device on enp94s0f1
[ 8.556911] IPv6: ADDRCONF(NETDEV_UP): enp94s0f1: link is not ready
[ 8.559676] IPv6: ADDRCONF(NETDEV_UP): enp94s0f0: link is not ready
[ 8.764866] ixgbe 0000:5e:00.0: registered PHC device on enp94s0f0
[ 8.872872] IPv6: ADDRCONF(NETDEV_UP): enp94s0f0: link is not ready
[ 8.875737] IPv6: ADDRCONF(NETDEV_UP): enp62s0f1: link is not ready
[ 8.986630] IPv6: ADDRCONF(NETDEV_UP): enp62s0f1: link is not ready
[ 8.989461] IPv6: ADDRCONF(NETDEV_UP): enp62s0f0: link is not ready
[ 9.100552] IPv6: ADDRCONF(NETDEV_UP): enp62s0f0: link is not ready
[ 12.580475] igb 0000:3e:00.1 enp62s0f1: igb: enp62s0f1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX
[ 12.580643] IPv6: ADDRCONF(NETDEV_CHANGE): enp62s0f1: link becomes ready
[ 288.162212] nvidia-uvm: Loaded the UVM driver in 8 mode, major device number 235
[ 312.669332] EXT4-fs (sdb1): mounted filesystem with ordered data mode. Opts: (null)
[ 9612.666470] perf: interrupt took too long (2514 > 2500), lowering kernel.perf_event_max_sample_rate to 79500
[11741.513477] perf: interrupt took too long (3149 > 3142), lowering kernel.perf_event_max_sample_rate to 63500
[17484.108158] perf: interrupt took too long (3938 > 3936), lowering kernel.perf_event_max_sample_rate to 50750
[26616.408739] perf: interrupt took too long (4930 > 4922), lowering kernel.perf_event_max_sample_rate to 40500
[83800.510320] perf: interrupt took too long (6169 > 6162), lowering kernel.perf_event_max_sample_rate to 32250

njuffa · February 7, 2021, 1:51am

I don’t see anything in the information provided so far that suggests that this issue is directly related to CUDA or GPUs.

I would suggest using standard debugging methods to narrow down the exact failure mechanism.

Topic		Replies	Views
CUDA missing GPU CUDA Setup and Installation	6	6641	June 6, 2017
System uses only 1 out of 4 GPUs at a time on Azure NC instance CUDA Setup and Installation	6	2206	January 1, 2018
GPU breaks down after error CUDA Programming and Performance	1	817	November 3, 2010
410.66 crash and system freeze under heavy load (Xid 8, Xid 38) Linux	13	2126	November 15, 2018
GPUs are stuck when using multiple GPUs to train CUDA Programming and Performance	4	2099	November 13, 2020
410.78 driver, GPUs will lock up Linux	7	2875	March 29, 2019
The GPU process takes the GPU but there is not memory usage and the process hangs. Forum Feedback	2	1241	October 8, 2021
Multi-GPU performance incredibly slow CUDA Programming and Performance	7	3352	January 2, 2020
Did TensorFlow caused GPU memory crash? CUDA Setup and Installation	5	5083	April 26, 2017
GPU breaks down after error CUDA Programming and Performance	3	10790	November 16, 2010

Multi-GPUs stuck/freeze while one GPU works well

Related topics