MPS client hangs

I run cuda nbody samples on v100 gpu with command: ./nbody -benchmark -i=1 -numbodies=256, but it sometimes hangs.
Such phenomenon doesn’t happen when mps is switched off.

informations that may be useful:

1. nvidia-smi output:
Mon Feb 10 17:49:30 2020
±----------------------------------------------------------------------------+
| NVIDIA-SMI 418.116.00 Driver Version: 418.116.00 CUDA Version: 10.1 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2… Off | 00000000:8A:00.0 Off | 0 |
| N/A 36C P0 37W / 300W | 53MiB / 16130MiB | 0% E. Process |
±------------------------------±---------------------±---------------------+
| 1 Tesla V100-SXM2… Off | 00000000:8B:00.0 Off | 0 |
| N/A 37C P0 52W / 300W | 14181MiB / 16130MiB | 0% E. Process |
±------------------------------±---------------------±---------------------+
| 2 Tesla V100-SXM2… Off | 00000000:B3:00.0 Off | 0 |
| N/A 34C P0 43W / 300W | 39MiB / 16130MiB | 0% E. Process |
±------------------------------±---------------------±---------------------+
| 3 Tesla V100-SXM2… Off | 00000000:B4:00.0 Off | 0 |
| N/A 35C P0 37W / 300W | 39MiB / 16130MiB | 0% E. Process |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 24127 C nvidia-cuda-mps-server 29MiB |
| 0 78754 M+C ./nbody 14MiB |
| 1 3038 M+C python 14169MiB |
| 2 24127 C nvidia-cuda-mps-server 29MiB |
| 3 24127 C nvidia-cuda-mps-server 29MiB |
±----------------------------------------------------------------------------+

2. steps for switching on mps:
export CUDA_VISIBLE_DEVICES=0,2,3
nvidia-cuda-mps-control -d

3. run ./nbody in docker container
start and exec a container: docker run -it --rm --env NVIDIA_VISIBLE_DEVICES=0 --ipc=host ae36ff35deec bash
execute nbody command: ./nbody -benchmark -i=1 -numbodies=256

4. mps log
/var/log/nvidia-mps/control.log:
[2020-02-10 17:49:04.629 Control 11125] Accepting connection…
[2020-02-10 17:49:04.629 Control 11125] User did not send valid credentials
[2020-02-10 17:49:04.629 Control 11125] Accepting connection…
[2020-02-10 17:49:04.629 Control 11125] NEW CLIENT 78675 from user 0: Server already exists
[2020-02-10 17:49:08.013 Control 11125] Accepting connection…
[2020-02-10 17:49:08.013 Control 11125] User did not send valid credentials
[2020-02-10 17:49:08.013 Control 11125] Accepting connection…
[2020-02-10 17:49:08.013 Control 11125] NEW CLIENT 78754 from user 0: Server already exists
[2020-02-10 17:49:08.034 Control 11125] Accepting connection…
[2020-02-10 17:49:08.034 Control 11125] NEW CLIENT 78754 from user 0: Server already exists

/var/log/nvidia-mps/server.log:
[2020-02-10 17:49:04.629 Other 24127] MPS Server: worker created
[2020-02-10 17:49:04.629 Other 24127] Volta MPS: Creating worker thread
[2020-02-10 17:49:04.650 Other 24127] Receive command failed, assuming client exit
[2020-02-10 17:49:04.650 Other 24127] Volta MPS: Client disconnected
[2020-02-10 17:49:08.013 Other 24127] Volta MPS Server: Received new client request
[2020-02-10 17:49:08.013 Other 24127] MPS Server: worker created
[2020-02-10 17:49:08.013 Other 24127] Volta MPS: Creating worker thread
[2020-02-10 17:49:08.034 Other 24127] Volta MPS Server: Received new client request
[2020-02-10 17:49:08.034 Other 24127] MPS Server: worker created
[2020-02-10 17:49:08.034 Other 24127] Volta MPS: Creating worker thread
[2020-02-10 17:49:08.034 Other 24127] Volta MPS: Device Tesla V100-SXM2-16GB (uuid 0x12b0f689-0x2ce53dc5-0x5d8591f6-0x9a1acdb3) is associated

5. operation system
CentOS Linux release 7.3.1611 (Core)
kernel version: 4.17.11-1.el7.elrepo.x86_64

operations to trigger this:

  1. first execute is ok
    ./nbody -benchmark -i=1 -numbodies=256
  2. change command arguments to run longer
    ./nbody -benchmark -i=100 -numbodies=819200
    after a few seconds, press ctrl+c to cancel it
    3.execute step 1 again, then it hangs

Any advice on how to debug the hang problem? Thanks very much!

Hi @zsh912006,
did you ever figure out what was happening, or a way to prevent the problem ?

.Andrea

I’m not sure if you will see this post…

When you run docker container, try volume(mount) /tmp/nvidia-mps/ using -v option.
docker run -it --rm --env NVIDIA_VISIBLE_DEVICES=0 -v /tmp/nvidia-mps --ipc=host ae36ff35deec bash

This and this and this may be of interest.