Recently I was testing the mps feature of cuda.I intended to compare the outputs of my code with mps on and off.I`m using GTX1060 3GB and cuda9.0 on ubuntu16.04
__global__ void vectorAdd1(const float *A, const float *B, float *C,
int numElements) {
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < numElements) {
C[i] = A[i] + B[i];
int tmp = 0;
while (tmp <= 1000000) {tmp++;}
}
}
int main(void) {
clock_t start, finish;
double duration;
//=================start==================//
start = clock();
// Error code to check return values for CUDA calls
cudaError_t err = cudaSuccess;
// Prlong the vector length to be used, and compute its size
int numElements = 1024 * 90;
size_t size = numElements * sizeof(float);
float *h_A = (float *) malloc(size);
float *h_B = (float *) malloc(size);
float *h_C = (float *) malloc(size);
// Verify that allocations succeeded
if (h_A == NULL || h_B == NULL || h_C == NULL) {
fprintf(stderr, "Failed to allocate host vectors!\n");
exit(EXIT_FAILURE);
}
for (long i = 0; i < numElements; ++i) {
h_A[i] = rand() / (float) RAND_MAX;
h_B[i] = rand() / (float) RAND_MAX;
}
float *d_A = NULL;
err = cudaMalloc((void **) &d_A, size);
float *d_B = NULL;
err = cudaMalloc((void **) &d_B, size);
float *d_C = NULL;
err = cudaMalloc((void **) &d_C, size);
err = cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
err = cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
cudaStream_t stream1;
cudaStreamCreate(&stream1);
// Launch the Vector Add CUDA Kernel
int threadsPerBlock = 1024;
int blocksPerGrid = (numElements + threadsPerBlock - 1) / threadsPerBlock;
printf("CUDA kernel launch with %d blocks of %d threads\n", blocksPerGrid,
threadsPerBlock);
vectorAdd1<<<blocksPerGrid, threadsPerBlock,0,stream1>>>(d_A, d_B, d_C, numElements);
err = cudaGetLastError();
cudaDeviceSynchronize();
//=================finish==================//
finish = clock();
err = cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
// Free device global memory
err = cudaFree(d_A);
err = cudaFree(d_B);
err = cudaFree(d_C);
// Free host memoryb
free(h_A);
free(h_B);
free(h_C);
duration = (double) (finish - start) / CLOCKS_PER_SEC;
printf("time : %lf\n", duration);
return 0;
}
Unlike the normal experiment, I tested the time from the start of the program to the end of Kernel. Found that kernels aren`t processed concurrently on the GPU, but executed in a sequential order
fish@fish:~/cuda-workspace/ABC/Debug$ ./ABC
CUDA kernel launch with 90 blocks of 1024 threads
time : 1.367465
fish@fish:~/cuda-workspace/ABC/Debug$ ./ABC & ./ABC
CUDA kernel launch with 90 blocks of 1024 threads
CUDA kernel launch with 90 blocks of 1024 threads
time : 1.348312
time : 2.529013
The timeline made with nvvp also matches the experimental results
External Media
The result after closing MPS is this, in line with expectations, because in the pascal architecture, the GPU schedules multiple APPs in a time-slice manner.
fish@fish:~/cuda-workspace/ABC/Debug$ ./ABC & ./ABC
CUDA kernel launch with 90 blocks of 1024 threads
CUDA kernel launch with 90 blocks of 1024 threads
time : 2.808099
time : 2.811294
This is just a small program I wrote. It can’t show the advantages of MPS. I also tried to run some apps in rodinia at the same time. The same effect appeared in the above. In fact, I didn’t find any two programs that can achieve the parallelization described by MPS, unless I limit the sum of the number of CTAs of the two APPs to less than 9 (that is the number of my 1060 3GB SMs). This makes no sense for the actual program, because the actual program has a lot of CTAs.
So can you explain why this happens, or provide some useful examples of what kind of app can achieve true parallelism under MPS