As i said in this topic, i have a monte carlo implementation.
I get a cpu and gpu time for serial and parallel execution of function and kernel.
Program run together, first serial and after parallel code.
Then i calculate the efficiency of execution based on thread number.
I’m gating this results:
for 33554432 iterations (with 512 X 65536 in args) and 32 threads per block
i have 1394.07ms for gpu and 6646.00ms for cpu
and efficiency is 14.90%.
If i change the thread per block number, then the efficiency drop to half.
I have (thread_per_block, efficiency, gpu_time, cpu_time):
32 | 14.90 | 1394.07 | 6646.00 |
---|---|---|---|
64 | 7.42 | 1399.73 | 6646.00 |
128 | 3.64 | 1427.69 | 6646.00 |
256 | 1.86 | 1395.64 | 6646.00 |
I’m trying to explain why i have better efficiency with 32 threads and why the 256 threads i have the worst, although the gpu time is better than 64 and 128 threads.
I run nvprof for metrics achieved_occupancy, branch_efficiency and warp_execution_efficiency.
The achieved_occupancy increase while the thread number are changing for 32 to 256.
The branch_efficiency and warp_execution_efficiency remaining constant.
How could explain that change in efficiency based on thread per block?