MPS client hangs

zsh912006 · February 10, 2020, 10:02am

I run cuda nbody samples on v100 gpu with command: ./nbody -benchmark -i=1 -numbodies=256, but it sometimes hangs.
Such phenomenon doesn’t happen when mps is switched off.

informations that may be useful:

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 24127 C nvidia-cuda-mps-server 29MiB |
| 0 78754 M+C ./nbody 14MiB |
| 1 3038 M+C python 14169MiB |
| 2 24127 C nvidia-cuda-mps-server 29MiB |
| 3 24127 C nvidia-cuda-mps-server 29MiB |
±----------------------------------------------------------------------------+

2. steps for switching on mps:
export CUDA_VISIBLE_DEVICES=0,2,3
nvidia-cuda-mps-control -d

3. run ./nbody in docker container
start and exec a container: docker run -it --rm --env NVIDIA_VISIBLE_DEVICES=0 --ipc=host ae36ff35deec bash
execute nbody command: ./nbody -benchmark -i=1 -numbodies=256

4. mps log
/var/log/nvidia-mps/control.log:
[2020-02-10 17:49:04.629 Control 11125] Accepting connection…
[2020-02-10 17:49:04.629 Control 11125] User did not send valid credentials
[2020-02-10 17:49:04.629 Control 11125] Accepting connection…
[2020-02-10 17:49:04.629 Control 11125] NEW CLIENT 78675 from user 0: Server already exists
[2020-02-10 17:49:08.013 Control 11125] Accepting connection…
[2020-02-10 17:49:08.013 Control 11125] User did not send valid credentials
[2020-02-10 17:49:08.013 Control 11125] Accepting connection…
[2020-02-10 17:49:08.013 Control 11125] NEW CLIENT 78754 from user 0: Server already exists
[2020-02-10 17:49:08.034 Control 11125] Accepting connection…
[2020-02-10 17:49:08.034 Control 11125] NEW CLIENT 78754 from user 0: Server already exists

/var/log/nvidia-mps/server.log:
[2020-02-10 17:49:04.629 Other 24127] MPS Server: worker created
[2020-02-10 17:49:04.629 Other 24127] Volta MPS: Creating worker thread
[2020-02-10 17:49:04.650 Other 24127] Receive command failed, assuming client exit
[2020-02-10 17:49:04.650 Other 24127] Volta MPS: Client disconnected
[2020-02-10 17:49:08.013 Other 24127] Volta MPS Server: Received new client request
[2020-02-10 17:49:08.013 Other 24127] MPS Server: worker created
[2020-02-10 17:49:08.013 Other 24127] Volta MPS: Creating worker thread
[2020-02-10 17:49:08.034 Other 24127] Volta MPS Server: Received new client request
[2020-02-10 17:49:08.034 Other 24127] MPS Server: worker created
[2020-02-10 17:49:08.034 Other 24127] Volta MPS: Creating worker thread
[2020-02-10 17:49:08.034 Other 24127] Volta MPS: Device Tesla V100-SXM2-16GB (uuid 0x12b0f689-0x2ce53dc5-0x5d8591f6-0x9a1acdb3) is associated

5. operation system
CentOS Linux release 7.3.1611 (Core)
kernel version: 4.17.11-1.el7.elrepo.x86_64

operations to trigger this：

first execute is ok
./nbody -benchmark -i=1 -numbodies=256
change command arguments to run longer
./nbody -benchmark -i=100 -numbodies=819200
after a few seconds, press ctrl+c to cancel it
3.execute step 1 again, then it hangs

Any advice on how to debug the hang problem? Thanks very much!

fwyzard · July 31, 2022, 9:47am

Hi @zsh912006,
did you ever figure out what was happening, or a way to prevent the problem ?

.Andrea

sungin.h · October 27, 2022, 1:03pm

I’m not sure if you will see this post…

When you run docker container, try volume(mount) /tmp/nvidia-mps/ using -v option.
docker run -it --rm --env NVIDIA_VISIBLE_DEVICES=0 -v /tmp/nvidia-mps --ipc=host ae36ff35deec bash

Robert_Crovella · October 27, 2022, 1:07pm

This and this and this may be of interest.

Topic		Replies	Views
MPS is not working CUDA Programming and Performance	7	3182	July 13, 2022
MPI running issue using NVIDIA MPS Service on Multi-GPU nodes CUDA Programming and Performance	4	2205	September 16, 2016
pre-volta MPS test failed with error: mapping of buffer object failed CUDA Programming and Performance	3	1187	June 13, 2019
Unable to get MPS server working CUDA Programming and Performance	3	1857	February 23, 2024
MPS Server is working with a single node multi-GPU but not working with two nodes multi-GPU CUDA Programming and Performance	0	660	March 28, 2024
Get a Segmentation fault in MPS CUDA Programming and Performance cuda	0	35	January 22, 2025
Fail to launch CUDA-MPS CUDA Programming and Performance	9	8510	October 26, 2015
MPS client failed to reserve virtual memory range at address (nil) CUDA Programming and Performance	2	887	January 11, 2020
Mutli Process Service crashes on setting up the `CUDA_MPS_ACTIVE_THREAD_PERCENTAGE` when launching a huge number of processes (say around 40~48 ) CUDA Programming and Performance cuda , kernel , gpu , gpu-computing	0	723	August 11, 2023
Multi-Process freeze with docker CUDA Programming and Performance	1	889	August 31, 2023

MPS client hangs

Related topics