Experience indicates that in general, finer granularity (fewer threads per block) is often advantageous for GPU performance, all other parameters being equal. A good initial target for block size is 128 to 256 threads; possibly smaller with the latest GPU architectures. While there are a number of possible architectural explanations (e.g. unbalanced execution in large thread blocks, ramp-up and ramp-down effects at the start and end of kernels) for this, differences can also be due to hardware implementation artifacts (“butterfly effects”), especially for memory-intensive codes.
The interactions of multiple levels of scheduling, the specific sequences of loads and stores in each warp, and buffering and re-ordering in the memory controllers are very complicated, and cannot be modeled satisfactorily with publicly available information. In my work I have found that often there is no readily discernable cause-effect pattern in multi-dimensional shmoo plots of performance data, which in turn indicates that brute force auto-tuning based on multiple configuration parameters would be useful.
A look at the profiler statistics should highlight what metrics in particular are affected by the change in block configuration for your code, and I would suggest documenting that as immediate causes of the observed performance differences.