I write a simple CUDA program to verify whether parallelization can boost GPU utilization (enable --default-stream per-thread
):
#pragma omp parallel for
for (int i = 0; i < array_size; i++)
{
while (1)
{
dgt_mul<<<gDim, bDim, 0, st>>>(......);
}
}
When program spawns 2 threads, the GPU utilization can be more than ~50%; but if spawns 3 threads, the utilization downgrades to ~30%.
I limit the iteration count and try to profile it:
cudaProfilerStart();
#pragma omp parallel for
for (int i = 0; i < array_size; i++)
{
for (int j = 0; j < 1000; j++)
{
dgt_mul<<<gDim, bDim, 0, st>>>(......);
}
}
cudaProfilerStop();
The following result is about 2 threads:
https://i.stack.imgur.com/vvcqz.jpg
While this is for 3 threads:
https://i.stack.imgur.com/piYVS.jpg
In 2 threads case, the parallelization seems OK, while for 3 threads case, it becomes nearly serial actually. I am not sure whether because cudaLaunchKernel becomes bottleneck.
Could anyone give some clues on this phenomenon? Thanks very much in advance!
P.S., this issue is posted in https://stackoverflow.com/questions/55177474/why-does-the-parallelization-make-gpu-utilization-become-lower, but no one answers. So I reposted it here, thanks!