Is there a performance penalty if a 256 Thread block is invoked as a 2-D 16x16 shape or 4 x 64 block or a 1D 256 x 1 shaped block ?
Should the choice of shape for the thread block dictated by factors such as Texture fetches, constant memory access etc… being done in the kernel ? or is this choice simply application centric and dictated by the algorithm.
In addition, if the Grid has thousands of Thread Blocks, how can one determine how many blocks will run concurrent on a single MP. Mainly to figure out how to divide the shared memory resources among concurrent blocks. Is there a way to limit the concurrency to a 4 or 8 blocks per MP at a time, so each block can get a decent size shared memory ?
Appreciate any observations.