Advantageous partitioning is not independent of the resource usage and data access patterns of a kernel.
In the absence of additional information a reasonable heuristic is to start with a thread count per block that is a multiple of 32 and between 128 to 256 inclusive. Use the CUDA profiler on this initial version of the application and base further code changes on feedback from the profiler as to what the bottlenecks are in the kernel.
Here you might want to try a 16x16 thread block for the initial attempt.