Is it a a good idea to set block size to be 32?

My application is going to solve a large group of linear equations. Every warp can solve two sets of linear equations. I created a 1D grid with 512 by 512 such blocks, it seems it doesn’t work. But if
I create a 2D grid 512 by 512 blocks. It seems to produce correct results.

1D grid can go only up to 65535 in pre-kepler GPUs.
In your case 512x512 > 65535

Tiny 32-thread blocks can result in SM/SMX underutilization on Fermi/Kepler.

Pre-Kepler devices can have a maximum of 8 active blocks. Kepler improves this to 16.

If your kernels are using ~60 registers then Fermi and Kepler will be unable to fully occupy an SM/SMX with 8 or 16 32-thread blocks because Fermi has enough registers for 16 63-register warps and Kepler supports 32. If your kernels use very few registers the underutilization will be even worse.

The rule of thumb I use is to launch blocks of at least two warps if I’m working with high register-count kernels targeting Fermi or Kepler.

Anyway, benchmarking will reveal if this matters in your application.

Take a look at the “Technical Specifications per Compute Capability” table in the CUDA C Programming Guide to see what I’m talking about.

You are saying the more registers you use, the better the utilization?

I think that statement was meant relative to which occupation could have been achieved with larger blocks (or without the 8 or 16 blocks/SM limit).

No, that’s not what I’m saying. I jumped ahead a few steps in that final sentence. :)

I’m saying to be aware that a grid containing a large number of 32-thread “tinyblocks” will reach the “max resident blocks” multiprocessor limit before reaching the “max resident threads” limit or kernel resource limit.

A multiprocessor conga line will form in this situation.

The microbenchmark below illustrates my point.

Each line reports: name, multiProcessorCount, maxThreadsPerMultiProcessor, maxThreadsPerBlock, blocks, threads, elapsed time.

Each line has a workload that should “fill” a multiprocessor because the initial block size is GCD(maxThreadsPerMultiProcessor,maxThreadsPerBlock) and the kernel is tiny. We should expect the elapsed times to be approximately the same until the max-resident blocks per multiprocessor limit is hit.

If you inspect the Kepler and Fermi results you can see in their final lines that 32-thread tinyblocks are being delayed by the resident block limit.

The 9400 GT tests never reach the sm_11 8 block limit, so nothing interesting is revealed. The GT 240 result is curious.

The gist is here.