Why does the parallelization make GPU utilization become lower?

xiao_nan · March 17, 2019, 7:26am

I write a simple CUDA program to verify whether parallelization can boost GPU utilization (enable --default-stream per-thread):

#pragma omp parallel for
    for (int i = 0; i < array_size; i++)
    {
        while (1)
	{
	    dgt_mul<<<gDim, bDim, 0, st>>>(......);
	}
    }

When program spawns 2 threads, the GPU utilization can be more than ~50%; but if spawns 3 threads, the utilization downgrades to ~30%.

I limit the iteration count and try to profile it:

cudaProfilerStart();
    #pragma omp parallel for
	for (int i = 0; i < array_size; i++)
	{
		for (int j = 0; j < 1000; j++)
		{
		    dgt_mul<<<gDim, bDim, 0, st>>>(......);
		}
	}
        cudaProfilerStop();

The following result is about 2 threads:
https://i.stack.imgur.com/vvcqz.jpg

While this is for 3 threads:
https://i.stack.imgur.com/piYVS.jpg

In 2 threads case, the parallelization seems OK, while for 3 threads case, it becomes nearly serial actually. I am not sure whether because cudaLaunchKernel becomes bottleneck.

Could anyone give some clues on this phenomenon? Thanks very much in advance!

P.S., this issue is posted in https://stackoverflow.com/questions/55177474/why-does-the-parallelization-make-gpu-utilization-become-lower, but no one answers. So I reposted it here, thanks!

xiao_nan · March 19, 2019, 8:30am

The reason should be the bottle neck of kernel launch queue (Please check my post: Parallelization may cause GPU utilization become worse | Nan Xiao's Blog).

Topic		Replies	Views
Parallelization schemes What schemes do you use when processing large datasets? CUDA Programming and Performance	6	901	December 23, 2010
CUDA is slower than expected. Is something missing? CUDA Programming and Performance cuda , gpu , gpu-computing , parallel-computing	4	182	July 7, 2024
Why the number of parallel threads slows down operation CUDA Programming and Performance cuda	2	234	March 20, 2024
Multiple CPU threads Performance hit CUDA Programming and Performance	5	5379	February 28, 2008
low concurrency and low kernel utilization, but kernels are filled. CUDA Programming and Performance	6	1406	November 18, 2018
multi host thread over multi gpu CUDA Programming and Performance	5	863	February 21, 2019
1080 ti usage stuck at 70% (no CPU/Memory bottleneck) CUDA Programming and Performance	0	495	February 7, 2018
concurrent kernels CUDA Programming and Performance	2	848	May 2, 2011
Multi GPU results in latencies in Linux CUDA Programming and Performance	4	1895	April 25, 2012
need suggestion for a 4D data computation project CUDA Programming and Performance	9	1048	December 17, 2018

Why does the parallelization make GPU utilization become lower?

Related topics