Why does it help to use more thread blocks?

arete · December 4, 2010, 9:27pm

The following kernel

__global__ void 

__launch_bounds__(NUMTHREADS, 1536/NUMTHREADS)

copystride(double *list, int n, double *copy){

  int tid = threadIdx.x + blockIdx.x*blockDim.x;

  int stride = blockDim.x*gridDim.x;

  for(int i=tid; i < n; i = i + stride)

    copy[i] = list[i]; 

}

is run with n = 563844000. With 384 threads per block and 56 blocks, the bandwidth is 90 GB/s on Tesla M2050 with error correction enabled. With 384 threads per block and 56000 blocks, the bandwidth is 100 GB/s. Tesla M2050 has 14 SMs and a limit of 3844 threads per SM. During any given cycle, the number of warps whose registers have been allocated on an SM will be 414*(384/32)=672 in either case. There is no spill to local memory.

My question is why does it help to have more thread blocks?

The CUDA documentation is sketchy at places, but on this point it is totally silent. The relevant part is Section 4.4 of NVIDIA best practices. It makes a case for having more warps that can be scheduled to hide latency to DRAM or latency due to dependencies in the instruction stream. That is totally understandable. My question is why does it help to have more thread blocks when the number of warps that can be scheduled in either scenario is the same.

I can understand having a lot of thread blocks to enable portability. Since the GPUs can have 14, 15, 16, or 30 SMs, having a large number of thread blocks will give good load balance on all the platforms. However, why does a large number of thread blocks improve bandwidth to memory?

Crankie · December 6, 2010, 6:04am

The following kernel
__global__ void 

__launch_bounds__(NUMTHREADS, 1536/NUMTHREADS)

copystride(double *list, int n, double *copy){

  int tid = threadIdx.x + blockIdx.x*blockDim.x;

  int stride = blockDim.x*gridDim.x;

  for(int i=tid; i < n; i = i + stride)

    copy[i] = list[i]; 

}
is run with n = 563844000. With 384 threads per block and 56 blocks, the bandwidth is 90 GB/s on Tesla M2050 with error correction enabled. With 384 threads per block and 56000 blocks, the bandwidth is 100 GB/s. Tesla M2050 has 14 SMs and a limit of 3844 threads per SM. During any given cycle, the number of active warps will be 414*(384/32)=672 in either case. There is no spill to local memory.

My question is why does it help to have more thread blocks?

The CUDA documentation is sketchy at places, but on this point it is totally silent. The relevant part is Section 4.4 of NVIDIA best practices. It makes a case for having more active warps to hide latency to DRAM or latency due to dependencies in the instruction stream. That is totally understandable. My question is why does it help to have more thread blocks when the number of active warps in either scenario is the same.

I can understand having a lot of thread blocks to enable portability. Since the GPUs can have 14, 15, 16, or 30 SMs, having a large number of thread blocks will give good load balance on all the platforms. However, why does a large number of thread blocks improve bandwidth to memory?

The warp scheduler can have many more warps to play with between warps waiting for data and warps ready to operate.

avidday · December 6, 2010, 7:49am

Is n constant in both cases? If so, isn’t this just Gustafson’s Law?

arete · December 6, 2010, 2:55pm

Crankie: I have edited my post. The earlier post used “active warps” imprecisely. The point is the number of thread blocks scheduled to one of the four SMs is 14*4 (14 SMs on M2050 and 4 thread blocks of size 384 on each SM). Therefore the number of warps that can potentially be scheduled during any given cycle is the same in the two cases.

avidday: n is the same for both cases.

avidday · December 6, 2010, 3:12pm

Minor brainfade…