I have this algorithm in which i compute a more or less complicated sum on a large dataset. I want to fine-grain my parallelism so that each thread computes on term of this sum and then add all this terms. Since I have to execute slightly different code on each term and each thread (some matrix-vector multiplication, sometimes the inverse matrix) I need flow control.
So I thought, I just split my sum on each warp, having each warp calculate the relevant terms on 32 data items. Therefore, I should not have any divergent branches, as each warp executes the same instructions.
But… I can not have enough threads to hide register read-after-write hazards and need to sync after each warp. So, each warp gets executed basically serial, which is reflected in the number ‘warp serialize’ in the profiler, if I am correct.
My question, if someone understands my proceeding, what is the impact of this warp serialization? I made some small experiments, putting together two or more terms into a single warp, the execution time gets slightly larger, 5% or so. But I am unsure regarding the big picture. Also clarification about this ‘warp serialize’-token in the profiler would be appretiated.