I’m running a kernel with 16 block per thread. But, since the warp size is 32 threads, the performance should increase if I use 32 or 64 threads, right?
But, if I increase the block size to 32 and 64, my performance is lower than when I use 16 threads. (My kernel is run for large arrays, so not using that many threads is not the problem)
The number of threads influences speed in a variety of conflicting ways.
Yes, a warp is 32 threads. But if your bottleneck is diverging threads, you might get more performance by using 16 threads per block and letting the multiprocessor run more blocks in interleaved parallel.
So why not always use 16? Many many reasons. Too few threads per block could be a bottleneck if you have shared memory and therefore there’s a limit of how many blocks can be run. And even without shared memory, there’s a parallel block limit (of 8).
It gets complicated and unpredictable. The general rule is to experiment with sizes. The profiler can help you figure out your bottlenecks. More threads helps with some bottlenecks (like scheduling starvation) and hurts others (like register occupancy). Add in different schedulers for 1.0 and 1.2 devices, relative latency of device memory on different GPUs, register count changes, shared memory overhead for arguments… your speeds become something hard to predict.