Optimum number of threads per block

I am working on speeding up some encryption algorithms, I observed that for plaintext sizes below 1MB, the combination of 32 byte size blocks encrypted by 32 threads per block is the optimal combination to give least possible kernel launch time and for plaintext sizes greater than 1MB, 64 threads launched per block, each thread dealing with 128 byte block forms the optimal combination. Why does this happen?