Unable to get MPS server working

I’m working to install drivers and cuda into an existing RedHat 7 machine.

GPU: Tesla V100-SXM2-16GB
Driver version: 535.154.05
Cuda version: 12.2

The drivers are installed, cuda is installed, and everything appears to be working, until you try to run something utilizing the MPS server.

The MPS control daemon is running but gives errors when trying to start server processes. I’ve provided the loaded kernel modules, server.log, control.log and nvidia-smi output below.

All it’s telling me is the driver initialization failed. I’m unsure where to go from here. Does anyone have any ideas or pointers that can help get this working?

deviceQuery:

# ./deviceQuery
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 805
-> MPS client failed to connect to the MPS control daemon or the MPS server
Result = FAIL

kernel modules:

# lsmod | grep nvidia
nvidia_uvm           1527792  4 
nvidia_drm             72892  0 
nvidia_modeset       1629048  1 nvidia_drm
nvidia              62227316  46 gdrdrv,nvidia_modeset,nvidia_uvm
drm_kms_helper        238197  2 ast,nvidia_drm
drm                   543438  6 nvidia,ast,ttm,nvidia_drm,drm_kms_helper

server.log:

[2024-02-21 17:59:54.332 Other 10339] Startup
[2024-02-21 17:59:54.332 Other 10339] Connecting to control daemon on socket: /tmp/nvidia-mps/control
[2024-02-21 17:59:54.332 Other 10339] Initializing server process
[2024-02-21 17:59:54.356 Other 10339] Driver initialization failed with: initialization error
[2024-02-21 17:59:54.356 Other 10339] Failed to start : initialization error
[2024-02-21 17:59:54.360 Other 10347] Startup
[2024-02-21 17:59:54.360 Other 10347] Connecting to control daemon on socket: /tmp/nvidia-mps/control
[2024-02-21 17:59:54.360 Other 10347] Initializing server process
[2024-02-21 17:59:54.386 Other 10347] Driver initialization failed with: initialization error
[2024-02-21 17:59:54.386 Other 10347] Failed to start : initialization error
[2024-02-21 17:59:54.389 Other 10351] Startup
[2024-02-21 17:59:54.389 Other 10351] Connecting to control daemon on socket: /tmp/nvidia-mps/control
[2024-02-21 17:59:54.389 Other 10351] Initializing server process
[2024-02-21 17:59:54.414 Other 10351] Driver initialization failed with: initialization error
[2024-02-21 17:59:54.414 Other 10351] Failed to start : initialization error
[2024-02-21 17:59:54.418 Other 10355] Startup
[2024-02-21 17:59:54.418 Other 10355] Connecting to control daemon on socket: /tmp/nvidia-mps/control
[2024-02-21 17:59:54.418 Other 10355] Initializing server process
[2024-02-21 17:59:54.444 Other 10355] Driver initialization failed with: initialization error
[2024-02-21 17:59:54.444 Other 10355] Failed to start : initialization error
[2024-02-21 17:59:54.448 Other 10359] Startup
[2024-02-21 17:59:54.448 Other 10359] Connecting to control daemon on socket: /tmp/nvidia-mps/control
[2024-02-21 17:59:54.448 Other 10359] Initializing server process
[2024-02-21 17:59:54.474 Other 10359] Driver initialization failed with: initialization error
[2024-02-21 17:59:54.474 Other 10359] Failed to start : initialization error
[2024-02-21 17:59:54.478 Other 10363] Startup
[2024-02-21 17:59:54.478 Other 10363] Connecting to control daemon on socket: /tmp/nvidia-mps/control
[2024-02-21 17:59:54.478 Other 10363] Initializing server process
[2024-02-21 17:59:54.501 Other 10363] Driver initialization failed with: initialization error
[2024-02-21 17:59:54.501 Other 10363] Failed to start : initialization error

control.log:

[2024-02-21 17:59:54.320 Control  4745] Accepting connection...
[2024-02-21 17:59:54.320 Control  4745] User did not send valid credentials
[2024-02-21 17:59:54.320 Control  4745] Accepting connection...
[2024-02-21 17:59:54.320 Control  4745] NEW CLIENT 10338 from user 0: Server is not ready, push client to pending list
[2024-02-21 17:59:54.320 Control  4745] Starting new server 10339 for user 0
[2024-02-21 17:59:54.332 Control  4745] Accepting connection...
[2024-02-21 17:59:54.356 Control  4745] Server encountered a fatal exception. Shutting down
[2024-02-21 17:59:54.356 Control  4745] Server 10339 exited with status 1
[2024-02-21 17:59:54.356 Control  4745] Starting new server 10347 for user 0
[2024-02-21 17:59:54.360 Control  4745] Accepting connection...
[2024-02-21 17:59:54.386 Control  4745] Server encountered a fatal exception. Shutting down
[2024-02-21 17:59:54.386 Control  4745] Server 10347 exited with status 1
[2024-02-21 17:59:54.386 Control  4745] Starting new server 10351 for user 0
[2024-02-21 17:59:54.389 Control  4745] Accepting connection...
[2024-02-21 17:59:54.414 Control  4745] Server encountered a fatal exception. Shutting down
[2024-02-21 17:59:54.414 Control  4745] Server 10351 exited with status 1
[2024-02-21 17:59:54.414 Control  4745] Starting new server 10355 for user 0
[2024-02-21 17:59:54.418 Control  4745] Accepting connection...
[2024-02-21 17:59:54.444 Control  4745] Server encountered a fatal exception. Shutting down
[2024-02-21 17:59:54.444 Control  4745] Server 10355 exited with status 1
[2024-02-21 17:59:54.444 Control  4745] Starting new server 10359 for user 0
[2024-02-21 17:59:54.448 Control  4745] Accepting connection...
[2024-02-21 17:59:54.474 Control  4745] Server encountered a fatal exception. Shutting down
[2024-02-21 17:59:54.474 Control  4745] Server 10359 exited with status 1
[2024-02-21 17:59:54.474 Control  4745] Starting new server 10363 for user 0
[2024-02-21 17:59:54.478 Control  4745] Accepting connection...
[2024-02-21 17:59:54.501 Control  4745] Server encountered a fatal exception. Shutting down
[2024-02-21 17:59:54.501 Control  4745] Server 10363 exited with status 1
[2024-02-21 17:59:54.501 Control  4745] Removed Shm file at 

nvidia-smi:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05             Driver Version: 535.154.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla V100-SXM2-16GB           Off | 00000004:04:00.0 Off |                    0 |
| N/A   32C    P0              36W / 300W |      0MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2-16GB           Off | 00000004:05:00.0 Off |                    0 |
| N/A   35C    P0              38W / 300W |      0MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2-16GB           Off | 00000035:03:00.0 Off |                    0 |
| N/A   31C    P0              37W / 300W |      0MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2-16GB           Off | 00000035:04:00.0 Off |                    0 |
| N/A   35C    P0              36W / 300W |      0MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

stop the MPS server

verify your CUDA install - run a sample code like vectorAdd or deviceQuery

does it run normally?

I was originally using driver version 535.154.05 and never did figure out what was going on. Even uninstalling and reinstalling didn’t work. Version 535.161.07 just became available and I tried that out today and everything works perfectly.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.