Multi-GPUs stuck/freeze while one GPU works well

Using 3 Tesla T4 for training. Works well with cuda(0) , cuda(1), cuda(2). But fails using
“nn.DataParraller(model)”
or
“subNetwork1.cuda(0)” “subNetwork.cuda(1)”.

Checking nvidia-smi,
It shows each of GPU has been occupied some of its space like n/16.0G but GPU-utils remain 0%.
And it just stuck/freeze there. Also using kill -9 cannot stop the process.
Only way to restore it is to reboot manually.
Can anyone help me fix this problem?

Does dmesg show any GPU-related error messages? You would primarily be looking for messages starting with NVRM (NVIDIA resource manager).

What does this mean? Could you clarify, please?

Exactly stuck how? I assume the display stops being updated (e.g. as indicated by a lack of clock changing) , it does not accept any keyboard input, but you can ssh in from another machine to try to kill -9 the process?

Hi njuffa,

I tried to use single GPU to train my network, with code line model.cuda(0/1/2), because I am using 3 Tesla T4 on my server.

It was not a display stuck. I could still work on other things but the python process just stuck there. It occupied the memory of GPU and cannot release even if I type kill -9 command on terminal.

I copied messages starting with NVRM , and it looks like this:
NVRM: loading NVIDIA UNIX x86_64 Kernel Module 410.129 Sun Jul 21 07:02:47 CDT 2019 (using threaded interrupts)
[ 7.564307] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 410.129 Sun Jul 21 07:01:32 CDT 2019
[ 7.566077] [drm] [nvidia-drm] [GPU ID 0x00001b00] Loading driver
[ 7.566078] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:1b:00.0 on minor 1
[ 7.566147] [drm] [nvidia-drm] [GPU ID 0x00001d00] Loading driver
[ 7.566148] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:1d:00.0 on minor 2
[ 7.566205] [drm] [nvidia-drm] [GPU ID 0x00004000] Loading driver
[ 7.566206] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:40:00.0 on minor 3
[ 7.606716] EDAC MC0: Giving out device to module skx_edac controller Skylake Socket#0 IMC#0: DEV 0000:3a:0a.0 (INTERRUPT)
[ 7.606823] EDAC MC1: Giving out device to module skx_edac controller Skylake Socket#0 IMC#1: DEV 0000:3a:0c.0 (INTERRUPT)
[ 7.607080] EDAC MC2: Giving out device to module skx_edac controller Skylake Socket#1 IMC#0: DEV 0000:ae:0a.0 (INTERRUPT)
[ 7.607525] EDAC MC3: Giving out device to module skx_edac controller Skylake Socket#1 IMC#1: DEV 0000:ae:0c.0 (INTERRUPT)
[ 7.619597] intel_rapl: Found RAPL domain package
[ 7.619601] intel_rapl: Found RAPL domain dram
[ 7.619603] intel_rapl: DRAM domain energy unit 15300pj
[ 7.619604] intel_rapl: RAPL package 0 domain package locked by BIOS
[ 7.620272] intel_rapl: Found RAPL domain package
[ 7.620276] intel_rapl: Found RAPL domain dram
[ 7.620278] intel_rapl: DRAM domain energy unit 15300pj
[ 7.620279] intel_rapl: RAPL package 1 domain package locked by BIOS
[ 7.764190] Adding 998396k swap on /dev/sda5. Priority:-2 extents:1 across:998396k FS
[ 8.251399] IPv6: ADDRCONF(NETDEV_UP): enp94s0f1: link is not ready
[ 8.451799] ixgbe 0000:5e:00.1: registered PHC device on enp94s0f1
[ 8.556911] IPv6: ADDRCONF(NETDEV_UP): enp94s0f1: link is not ready
[ 8.559676] IPv6: ADDRCONF(NETDEV_UP): enp94s0f0: link is not ready
[ 8.764866] ixgbe 0000:5e:00.0: registered PHC device on enp94s0f0
[ 8.872872] IPv6: ADDRCONF(NETDEV_UP): enp94s0f0: link is not ready
[ 8.875737] IPv6: ADDRCONF(NETDEV_UP): enp62s0f1: link is not ready
[ 8.986630] IPv6: ADDRCONF(NETDEV_UP): enp62s0f1: link is not ready
[ 8.989461] IPv6: ADDRCONF(NETDEV_UP): enp62s0f0: link is not ready
[ 9.100552] IPv6: ADDRCONF(NETDEV_UP): enp62s0f0: link is not ready
[ 12.580475] igb 0000:3e:00.1 enp62s0f1: igb: enp62s0f1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX
[ 12.580643] IPv6: ADDRCONF(NETDEV_CHANGE): enp62s0f1: link becomes ready
[ 288.162212] nvidia-uvm: Loaded the UVM driver in 8 mode, major device number 235
[ 312.669332] EXT4-fs (sdb1): mounted filesystem with ordered data mode. Opts: (null)
[ 9612.666470] perf: interrupt took too long (2514 > 2500), lowering kernel.perf_event_max_sample_rate to 79500
[11741.513477] perf: interrupt took too long (3149 > 3142), lowering kernel.perf_event_max_sample_rate to 63500
[17484.108158] perf: interrupt took too long (3938 > 3936), lowering kernel.perf_event_max_sample_rate to 50750
[26616.408739] perf: interrupt took too long (4930 > 4922), lowering kernel.perf_event_max_sample_rate to 40500
[83800.510320] perf: interrupt took too long (6169 > 6162), lowering kernel.perf_event_max_sample_rate to 32250

I don’t see anything in the information provided so far that suggests that this issue is directly related to CUDA or GPUs.

I would suggest using standard debugging methods to narrow down the exact failure mechanism.