I am just describing to reproduce the issue which I am facing. I have the following toy CUDA program. [The compiled executable here is toy
]
/**************toy.cu*********************************/
#include <cuda.h>
#include <stdlib.h>
#include <stdio.h>
#include <assert.h>
#define BLOCK_SIZE 256
__global__
void do_something(float* d_array)
{
int idx = blockIdx.x*blockDim.x + threadIdx.x;
d_array[idx]*=100;
}
int main()
{
long N= 1<<10;
float *arr = (float*) malloc(N*sizeof(float));
long i;
for (i=1;i<=N;i++)
arr[i-1]=i;
float *d_array;
int ret;
ret = cudaMalloc(&d_array, N*sizeof(float));
printf("Return value of cudaMalloc = %d\n", ret);
ret = cudaMemcpy(d_array, arr, N*sizeof(float), cudaMemcpyHostToDevice);
printf("Return value of cudaMemcpy = %d\n", ret);
int num_blocks= (N+BLOCK_SIZE-1)/BLOCK_SIZE;
do_something<<<num_blocks, BLOCK_SIZE>>>(d_array);
ret = cudaMemcpy(arr, d_array, N*sizeof(float), cudaMemcpyDeviceToHost);
printf("Return value of cudaMemcpy = %d\n", ret);
int j;
for(i=0;i<N;)
{
for(j=0;j<8;j++)
printf("%.0f\t", arr[i++]);
printf("\n");
}
cudaFree(d_array);
return 0;
}
Using the following script, I can launch many instances of the said program simultaneously without any issue when MPS is not running.
#!/bin/bash
# Check if the number of loop iterations is provided
if [ "$#" -lt 1 ]; then
echo "Usage: $0 <num_iterations>"
exit 1
fi
# Access the number of loop iterations from the first command-line argument
num_iterations="$1"
# Loop using the provided number of iterations
for (( i = 1; i <= num_iterations; i++ )); do
./toy &
done
$ ./toy_launch.sh 40 >> /dev/null
The above script works fine without MPS.
I enable MPS with the following command:
sudo CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=0 nvidia-cuda-mps-control -d
$ ./toy_launch.sh 40 >> /dev/null
The script above works fine until I set the CUDA_MPS_ACTIVE_THREAD_PERCENTAGE
environment variable.
I set the environment variable as follows:
$ export CUDA_MPS_ACTIVE_THREAD_PERCENTAGE=2
Now the request to run the same script :
$ ./toy_launch.sh 40 >> /dev/null
It causes the MPS system to hang after processing just 18
requests or so.
The machine is unable to execute any more GPU programs. The nvidia-smi
shows the nvidia-cuda-mps-server
running. But trying to quit
the daemon as :
$ sudo nvidia-cuda-mps-control
quit
It does not seem to have any effect—instead, the prompt hangs there. Manually killing the daemon using the kill
command using the PID of the server stops MPS, and I can launch GPU programs.
But, the problem arises when I try restarting the MPS.
sudo CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=0 nvidia-cuda-mps-control -d
And then trying to launch the CUDA program functions without using the GPU.
The value returned is :
...
1 2 3 4 5 6 7 8
9 10 11 12 13 14 15 16
17 18 19 20 21 22 23 24
25 26 27 28 29 30 31 32
33...
instead of,
...
100 200 300 400 500 600 700 800
900 1000 1100 1200 1300 1400 1500 1600
1700 1800 1900 2000 2100 2200 2300 2400
2500 2600 2700 2800 2900 3000 3100 3200
3300...
And the nvidia-smi
does not report the nvidia-cuda-mps-server
after the execution of the above program. [Note that during the execution of the program, the nvidia-smi
just flashes the nvidia-cuda-mps-server
for a very little time, and then it goes away. It seems that it is trying to start but is unable to.]