The threads are divided into small groups called ‘warps’ that share instruction counter and thus if there is at least one thread taking a particular path at the if-else then the other have to wait for it to ‘catch up’. I believe this is the case for all CUDA/OpenCL capable NVIDIA hardware. However, if all threads of the warp takes the same branch, then the un-used instructions are skipped entirely. I cannot explain why you see increased runtimes for a single thread.