MPS client failed to reserve virtual memory range at address (nil)

Hi.

Trying to run multi-threading application (same process, multi-threads using same CUDA -resources) using CUDA 10.1.
Nothing special set unsing “nvidia-smi”
In that, started “nvidia-cuda-mps-control -f” to see possible errors.
Reason to use MPS-server is to avoid resource-conflicts causing “flickering” in images, when simultaneous threads are using CUDA.

But I don’t have luck on starting MPS-server as control daemon reports this behavior of “virtual memory” as shown below,.
Any hints about this ?

-niilo

[2019-06-10 16:54:21.601 Control 22509] Accepting connection…
[2019-06-10 16:54:21.601 Control 22509] User did not send valid credentials
[2019-06-10 16:54:21.601 Control 22509] Accepting connection…
[2019-06-10 16:54:21.601 Control 22509] NEW CLIENT 22635 from user 0: Server is not ready, push client to pending list
[2019-06-10 16:54:21.601 Control 22509] Starting new server 22721 for user 0
[2019-06-10 16:54:21.623 Other 22721] Start
[2019-06-10 16:54:21.825 Control 22509] Accepting connection…
[2019-06-10 16:54:21.825 Control 22509] NEW SERVER 22721: Ready
[2019-06-10 16:54:21.825 Other 22721] MPS Server: Received new client request
[2019-06-10 16:54:21.825 Other 22721] MPS Server: worker created
[2019-06-10 16:54:21.958 Control 22509] Accepting connection…
[2019-06-10 16:54:21.958 Control 22509] User did not send valid credentials
[2019-06-10 16:54:21.958 Control 22509] Accepting connection…
[2019-06-10 16:54:21.958 Control 22509] NEW CLIENT 22635 from user 0: Server already exists
[2019-06-10 16:54:21.958 Other 22721] MPS Server: Received new client request
[2019-06-10 16:54:21.958 Other 22721] MPS Server: worker created
[2019-06-10 16:54:21.959 Client 22635] MPS client failed to reserve virtual memory range at address (nil)


Systems :

uname -r
4.15.0-51-lowlatency

lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 16.04.6 LTS
Release: 16.04
Codename: xenial

model name : Intel® Xeon® E-2176G CPU @ 3.70GHz

Don’t know what that means, never heard of that reason for using MPS, and MPS to a first order approximation has no bearing on a single multi-threaded application or application behavior. It is primarily focused on behavior of independent processes, from independent applications, launched by the same user.

Having said all that, the MPS doc discusses various considerations for virtual memory usage. You may wish to search through the doc for the word “virtual” and see if any of the advised restrictions and/or workarounds apply:

https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf

(for example section 4.4, known issues)

I am running an 8 K80 set up on AWS and am trying to get MPS running. I have had success using the same enviroment on a single K80 system but on the 8 K80 system the above error message is seen in the server.log

The error occurs even if I only make 1 of the 8 GPUs visible using CUDA_VISIBLE_DEVICES. This set up should be very similar to the single GPU system that I did get up and running.

The MPS document does not help in this case. Is there any more information on why this error message appears and how do you fix it