The question should be “how many branches can be executed in parallel per warp”. The answer is one. But the loss of performance depends heavily on the nature of the branches.
Let’s assume your block has 128 threads. This evaluates to 4 warps, threads 0-31 end up in warp 0, threads 32-63 in warp 1 etc.
When a warp hits an MP, all SPs execute the same single instruction (SIMD). Each takes care of four threads, so the MP processes a single instruction from a single warp.
Now, if your branches are warp-aligned, meaning threads 0-31 take one path, 32-63 take another etc., you’re getting full performance and you don’t suffer from branching at all. If thread 0 takes one branch and threads 1-31 take another, this warp will be serialized and you will loose half of the performance, assuming here that all branches take the same time T to execute. If thread 0 takes one path, 1 takes another and 2-31 take a third control path, the execution time for this warp will be 3T and so on.
The maximum penalty for completely divergent branches (ie. every thread follows a different control path) is 32x. In this pessimistic scenario, each warp of the block will be processed sequentially over and over until it finishes.