Warps with threads that follow different branches are called “divergent warps”. Read the programming guide and you will know as much as any of us. Nobody except NVIDIA engineers really know how the hardware handles the situation.
From my experimentation, I can add that “a little” divergence doesn’t really seem to change the performance. i.e. a simple if (2-way divergence) or threads that loop for different numbers of iterations. However the hardware handles these situations, it does a very good job.
What you are proposing seems to be full 32-way divergence for every single warp which I doubt could possibly be done efficiently.
“Parallel” means many things on the GPU. Don’t worry about warps or blocks at first when trying to wrap your head around a GPU algorithm, just start with the threads. GPU’s implement a data-parallel paradigm, which means you perform exactly the same set of instructions on a large number of data elements (10’s of thousands to millions or more). You need to imagine that EVERY SINGLE thread is being calculated at exactly the same time. If you can cast your algorithm into this form, it should work very well on the GPU.
After you see this, then all of the details of blocks and warps, the interleaved execution for data latency hiding, etc… all just become implementation details. They are sometimes important for performance reasons (especially memory access patterns), but the same basic picture remains. A massive number of independent, but identical, calculations are being performed each on different data values.