It depends on many things. The more threads you have, the less blocks can be ran simultaneously but OTOH the less threads you have, the worse memory latency hiding is and read after write dependencies start to show etc… There’s a sweet spot - or maybe a “sweet range”. It’s usually somewhere between 128-256 threads per block and varies between problems. Read http://www.gpgpu.org/sc2007/SC07_CUDA_5_Op…tion_Harris.pdf from page 80, there’s some “heuristics” for finding the right number.
16x8=128, that’s a little on the low side. You can run a lot of blocks but for example you need at least 192 to completely hide RAW dependencies. You will need to experiment which works better for your kernels - more concurrent blocks or better RAW hiding.