As i said in this topic, i have a monte carlo implementation.
I get a cpu and gpu time for serial and parallel execution of function and kernel.
Program run together, first serial and after parallel code.
Then i calculate the efficiency of execution based on thread number.
I’m gating this results:
for 33554432 iterations (with 512 X 65536 in args) and 32 threads per block
i have 1394.07ms for gpu and 6646.00ms for cpu
and efficiency is 14.90%.
If i change the thread per block number, then the efficiency drop to half.
I have (thread_per_block, efficiency, gpu_time, cpu_time):
I’m trying to explain why i have better efficiency with 32 threads and why the 256 threads i have the worst, although the gpu time is better than 64 and 128 threads.
I run nvprof for metrics achieved_occupancy, branch_efficiency and warp_execution_efficiency.
The achieved_occupancy increase while the thread number are changing for 32 to 256.
The branch_efficiency and warp_execution_efficiency remaining constant.
How could explain that change in efficiency based on thread per block?