I repeat a question I already asked in another thread and what seems essential to me to understand this multiprocessor stuff. It was a question about overhead of conditionals and someone answered, that a warp that is masked out does not hurt performance. So if there is a conditional and all 32 threads in a warp do the same thing (the if- or the else-block) then there should be no overhead?
If it is mixed, then the warp is serialized. It does the if-block and the else-block with some threads masked out . But what does a kernel like this do:
if (warp%2==0) doThis()
Is there an instruction pointer per warp, so that every warp can run different code of different blocks?
Or are the warps divided and it works like “serialisation light”: first doThis() is executed with all even warps and the odd warps are masked out (so they dont hurt performance). If they finished the odd warps run the doThat() code… Or is the execution of doThis/doThat interleaved? What happens if there are syncthreads in doThis and doThat?
How deeply nested can conditional execution work? There must be some hardware for the masking of different threads for conditionals, for syncing, for global mem reads…
The main reason to try something like this is: if one has a problem that switches to different “modes” per thread (driven by data input) you maybe get a lot of threads that operate in one mode for a while. This could produce a lot serialisation even if there is some good chance that all threads in a warp are in the same mode…
A lot of questions, but I find it hard to write efficient code for an architecture that I don’t understand.