Thread Block Shape Versus Performance Choosing proper Thread Block Shape

Is there a performance penalty if a 256 Thread block is invoked as a 2-D 16x16 shape or 4 x 64 block or a 1D 256 x 1 shaped block ?

Should the choice of shape for the thread block dictated by factors such as Texture fetches, constant memory access etc… being done in the kernel ? or is this choice simply application centric and dictated by the algorithm.

In addition, if the Grid has thousands of Thread Blocks, how can one determine how many blocks will run concurrent on a single MP. Mainly to figure out how to divide the shared memory resources among concurrent blocks. Is there a way to limit the concurrency to a 4 or 8 blocks per MP at a time, so each block can get a decent size shared memory ?

Appreciate any observations.


Use the occupancy calculator to see how the block layout influences multiprocessor usage.


In short:

  • unless the “shape” of a threadbock affects memory access patterns by half-warps, I don’t think it affects performance. You can always think of threads as having IDs from 0 to (size of the block - 1), multi-dimesional IDs are just a convenience for the programmer. Threads are arranged into half-warps based on their one-dimensional IDs.

  • each threadblock gets its own 16KB of shared memory (other threadblocks cannot access it), no matter how many other blocks are sharing the same multiprocessor.

  • I don’t think you can make assumptions about which blocks get allocated to which multiprocessors, as that is done dynamically by the scheduler. You can probably count on the scheduler balancing the load.


Paulius, that is not fully correct. The warps seem to be filled along the “columns”, starting a fresh warp for every “row”. You can easily observe that a block layout like (1,n) runs slower than (n,1). I explain that with the warps never being filled in the first case. Do you have other info?


Interesting. No, I don’t have any other info. I’ve only tried (n,1) arrangements.


For the record, I saw Mark Harris talking on about why they called it “warp”. The reason is the analogy to weaving. The warp is the horizontal thread, the weft is the vertical. So warps are aligned horizontally.


When I run a (1,256) block layout the profiler still gives me 100% occupancy, although it does run much slower than (256,1). If it did start a new warp for each row that would be much more than the 24 limit per multiprocessor, and probably wouldn’t run. I attribute the delay to the kernel accessing global memory in columns for the (1,256) case, instead of in rows for the (256,1) case.

I wanted to be sure of what I was saying so I just swapped the indices in my kernel and then (1,256) ran as fast as (256,1).

edit: Actually after looking closer it does run a little slower still (about 3%)