optimal block size

Any advices about optimal block size?
I think, for G80, it’s better with a small block size under the constraint > 3 warps. For example 16x8 is better(small and >3warp). I think 16x8 is suitable for most cases. I tried it on some programs and it does. How about your experience?


It depends on many things. The more threads you have, the less blocks can be ran simultaneously but OTOH the less threads you have, the worse memory latency hiding is and read after write dependencies start to show etc… There’s a sweet spot - or maybe a “sweet range”. It’s usually somewhere between 128-256 threads per block and varies between problems. Read http://www.gpgpu.org/sc2007/SC07_CUDA_5_Op…tion_Harris.pdf from page 80, there’s some “heuristics” for finding the right number.

16x8=128, that’s a little on the low side. You can run a lot of blocks but for example you need at least 192 to completely hide RAW dependencies. You will need to experiment which works better for your kernels - more concurrent blocks or better RAW hiding.