It means that operations are serialized and you are unable to perform operations in a SIMD fashion. So with 16 FPUs operating in SIMD ( executing one warp (32 threads) in 2 clock cycles ) i guess you would get 1/16 the performance with complete serialization. CUDA gives the impression of each thread being completely scalar ( superscalar? ) but in fact one often want to consider not to branch a warp to much.
This is one of the downsides of SIMD but can often be alleviated by for example the use of ternary operators. I also believe the compiler uses branch “predication” for shorter if else statements which supposedly helps.