I have the same opinion.
I also make an experiment about branch’s affect.
But it shows me something that doesn’t match my thinking.
When the threads in a warp with different branch, they should be executed serialized.
I only launch one thread block and with 32 threads inside.
Branch_test_kernel3 is the kernel that without branch, and Branch_test_kernel4 with branches, one thread follow it’s absolutely path.
and the result is :
Kernel 3 start computing…
Kernel 3 Computing finish!
Kernel 3 Average processing time: 1.647360 (ms)
Kernel 4 start computing…
Kernel 4 Computing finish!
Kernel 4 Average processing time: 1.638650 (ms)
The times the two kernel take are almost the same.
Did I make a mistake or something I ignore?
Attached are the test code: the kernel and main.
Branch_test.cu.tar.gz (1.94 KB)
Branch_test_kernel.cu.tar.gz (944 Bytes)