The number of threads influences speed in a variety of conflicting ways.
Yes, a warp is 32 threads. But if your bottleneck is diverging threads, you might get more performance by using 16 threads per block and letting the multiprocessor run more blocks in interleaved parallel.
So why not always use 16? Many many reasons. Too few threads per block could be a bottleneck if you have shared memory and therefore there’s a limit of how many blocks can be run. And even without shared memory, there’s a parallel block limit (of 8).
It gets complicated and unpredictable. The general rule is to experiment with sizes. The profiler can help you figure out your bottlenecks. More threads helps with some bottlenecks (like scheduling starvation) and hurts others (like register occupancy). Add in different schedulers for 1.0 and 1.2 devices, relative latency of device memory on different GPUs, register count changes, shared memory overhead for arguments… your speeds become something hard to predict.