Result exmplain

As i said in this topic, i have a monte carlo implementation.
I get a cpu and gpu time for serial and parallel execution of function and kernel.
Program run together, first serial and after parallel code.
Then i calculate the efficiency of execution based on thread number.
I’m gating this results:
for 33554432 iterations (with 512 X 65536 in args) and 32 threads per block
i have 1394.07ms for gpu and 6646.00ms for cpu
and efficiency is 14.90%.
If i change the thread per block number, then the efficiency drop to half.
I have (thread_per_block, efficiency, gpu_time, cpu_time):

32 14.90 1394.07 6646.00
64 7.42 1399.73 6646.00
128 3.64 1427.69 6646.00
256 1.86 1395.64 6646.00

I’m trying to explain why i have better efficiency with 32 threads and why the 256 threads i have the worst, although the gpu time is better than 64 and 128 threads.
I run nvprof for metrics achieved_occupancy, branch_efficiency and warp_execution_efficiency.
The achieved_occupancy increase while the thread number are changing for 32 to 256.
The branch_efficiency and warp_execution_efficiency remaining constant.

How could explain that change in efficiency based on thread per block?