MULTI-PROCESS SERVICE(MPS) has no effect

Recently I was testing the mps feature of cuda.I intended to compare the outputs of my code with mps on and off.I`m using GTX1060 3GB and cuda9.0 on ubuntu16.04

__global__ void vectorAdd1(const float *A, const float *B, float *C,
		int numElements) {
	int i = blockDim.x * blockIdx.x + threadIdx.x;
	if (i < numElements) {
      	C[i] = A[i] + B[i];
		int tmp = 0;
		while (tmp <= 1000000) {tmp++;}
	}
}
int main(void) {
	clock_t start, finish;
	double duration;
        //=================start==================//
	start = clock();
	// Error code to check return values for CUDA calls
	cudaError_t err = cudaSuccess;

	// Prlong the vector length to be used, and compute its size
	int numElements = 1024 * 90;
	size_t size = numElements * sizeof(float);

	float *h_A = (float *) malloc(size);
	float *h_B = (float *) malloc(size);
	float *h_C = (float *) malloc(size);

	// Verify that allocations succeeded
	if (h_A == NULL || h_B == NULL || h_C == NULL) {
		fprintf(stderr, "Failed to allocate host vectors!\n");
		exit(EXIT_FAILURE);
	}

	for (long i = 0; i < numElements; ++i) {
		h_A[i] = rand() / (float) RAND_MAX;
		h_B[i] = rand() / (float) RAND_MAX;
	}

	float *d_A = NULL;
	err = cudaMalloc((void **) &d_A, size);
	float *d_B = NULL;
	err = cudaMalloc((void **) &d_B, size);

	float *d_C = NULL;
	err = cudaMalloc((void **) &d_C, size);

	err = cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
	err = cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);

	cudaStream_t stream1;
	cudaStreamCreate(&stream1);

	// Launch the Vector Add CUDA Kernel
	int threadsPerBlock = 1024;
	int blocksPerGrid = (numElements + threadsPerBlock - 1) / threadsPerBlock;
	printf("CUDA kernel launch with %d blocks of %d threads\n", blocksPerGrid,
			threadsPerBlock);
  
	vectorAdd1<<<blocksPerGrid, threadsPerBlock,0,stream1>>>(d_A, d_B, d_C, numElements);
  
	err = cudaGetLastError();
	cudaDeviceSynchronize();

        //=================finish==================//
	finish = clock();

	err = cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);

	// Free device global memory
	err = cudaFree(d_A);
	err = cudaFree(d_B);
	err = cudaFree(d_C);

	// Free host memoryb
	free(h_A);
	free(h_B);
	free(h_C);

	duration = (double) (finish - start) / CLOCKS_PER_SEC;
	printf("time : %lf\n", duration);
	return 0;
}

Unlike the normal experiment, I tested the time from the start of the program to the end of Kernel. Found that kernels aren`t processed concurrently on the GPU, but executed in a sequential order

fish@fish:~/cuda-workspace/ABC/Debug$ ./ABC 
CUDA kernel launch with 90 blocks of 1024 threads
time : 1.367465

fish@fish:~/cuda-workspace/ABC/Debug$ ./ABC & ./ABC 
CUDA kernel launch with 90 blocks of 1024 threads
CUDA kernel launch with 90 blocks of 1024 threads
time : 1.348312
time : 2.529013

The timeline made with nvvp also matches the experimental results
External Media

The result after closing MPS is this, in line with expectations, because in the pascal architecture, the GPU schedules multiple APPs in a time-slice manner.

fish@fish:~/cuda-workspace/ABC/Debug$ ./ABC & ./ABC 
CUDA kernel launch with 90 blocks of 1024 threads
CUDA kernel launch with 90 blocks of 1024 threads

time : 2.808099
time : 2.811294

This is just a small program I wrote. It can’t show the advantages of MPS. I also tried to run some apps in rodinia at the same time. The same effect appeared in the above. In fact, I didn’t find any two programs that can achieve the parallelization described by MPS, unless I limit the sum of the number of CTAs of the two APPs to less than 9 (that is the number of my 1060 3GB SMs). This makes no sense for the actual program, because the actual program has a lot of CTAs.

So can you explain why this happens, or provide some useful examples of what kind of app can achieve true parallelism under MPS

[url]gpu - How do I use Nvidia Multi-process Service (MPS) to run multiple non-MPI CUDA applications? - Stack Overflow

I have read this answer,it can’t solve my problem.
My 1060 has 9 SMs.Suppose there are two processes, one of which has >9 blocks in the kernel, and the other cannot be paralleled, even if MPS is running.Is that right?

Assuming the MPS service is running, do the logs tell you anything?
Check at /var/log/nvidia-mps if something is updated when you launch the program(s).