Result exmplain

dionsalonik · February 24, 2021, 3:09pm

As i said in this topic, i have a monte carlo implementation.
I get a cpu and gpu time for serial and parallel execution of function and kernel.
Program run together, first serial and after parallel code.
Then i calculate the efficiency of execution based on thread number.
I’m gating this results:
for 33554432 iterations (with 512 X 65536 in args) and 32 threads per block
i have 1394.07ms for gpu and 6646.00ms for cpu
and efficiency is 14.90%.
If i change the thread per block number, then the efficiency drop to half.
I have (thread_per_block, efficiency, gpu_time, cpu_time):

32	14.90	1394.07	6646.00
64	7.42	1399.73	6646.00
128	3.64	1427.69	6646.00
256	1.86	1395.64	6646.00

I’m trying to explain why i have better efficiency with 32 threads and why the 256 threads i have the worst, although the gpu time is better than 64 and 128 threads.
I run nvprof for metrics achieved_occupancy, branch_efficiency and warp_execution_efficiency.
The achieved_occupancy increase while the thread number are changing for 32 to 256.
The branch_efficiency and warp_execution_efficiency remaining constant.

How could explain that change in efficiency based on thread per block?

Topic		Replies	Views
ideal number of tread per block CUDA Programming and Performance	10	3068	March 25, 2010
Fewer threads per block = ... faster performance? CUDA Programming and Performance	9	286	December 31, 2024
Basic Cuda Confusion - help CUDA Programming and Performance	9	2001	February 11, 2013
Creating new threads increases execution time ? CUDA Programming and Performance	1	4989	June 21, 2009
Tuning GPU code Profiler output interpretation CUDA Programming and Performance	5	6734	March 26, 2007
Here are my timing results, not impressive. Help. CUDA Programming and Performance	5	7078	January 30, 2008
How does number of blocks of threads effect gpu performance CUDA Programming and Performance	1	536	June 21, 2011
Maximizing the number of threads per block leads to longer kernel execution times CUDA Programming and Performance cuda , kernel	12	2384	December 19, 2023
CUDA, more threads for same work = Longer run time despite better occupancy, Why? CUDA Programming and Performance	9	6080	March 25, 2010
Number of Threads CUDA Programming and Performance	0	3056	August 15, 2010

Result exmplain

Related topics