speed difference on how to set up blocks sizes blocks (3,3,3) vs. (1,1,3)

i accidently run two different setting; grid(43,43) with block (3,3,33) and grid(129,129) with block(1,1,33)
there were speed difference (40% faster with the second setting) and i don’t understand why.

Thats the only change i made when i run the examples.

i assume that load 333=297 threads working parallel is still taking more time compared with just 33 threads running simultaneously?

any comments? Thanks well in advance.