the limit is 512. Usually 128-256 is a good number and is dependent on the resources used (registers and shared memory).
You want as many threads as possible to be scheduled on each multi processor (more or less 192 is a good minimum) to be able to hide latency issues when accessing registers.
There is a limit on the number of active blocks per multi gpu so you want large enough blocks to fill that minimum. On the other hand you need enough resources so if the block uses a lot of registers or shared memory there won’t be enough resources to schedule the block.
You also want the block to be a multiple of 32 (warp size) or you’ll be wasting threads, and have if possible groups of 16 thread accessing consecutive memory for coalescing.
On the other hand you want enough blocks to schedule over all the multi processors to fully utilize the card so you don’t want the blocks too large if your problem is small
I find that for 2D problems, having blocks of 16x16 or 16x12 or at the worst 16x8 do a good job