Question about divergence and branch granularity

Can someone help me understand how the following example used everywhere to go from divergent code to non-divergent code? If there were 16 threads in a block, In the first if statement, only threads 0 and 1 will execute the body of the if statement. However, in the second if statement, it looks like all the threads will execute the body of the if statement. How can this work if I only wanted selective threads in a block to copy something from the device memory to the shared memory? Also, is there an implicit barrier at the end of the if statement?

if (threadIdx.x < 2) { 


is the same as

if (threadIdx.x/WARP_SIZE < 2) { 

   // do something


The threads are executed in groups of 32. In the first case 2 threads will execute the if while the other 30 will not. This means that the warp is executed 2 times, onnce for th ebranch with 2 threads doing the if and once for the other not doing. After this the warp converges back at least in the warp everything is executed in the same time.

In the second case the first 2 warps execute the if. In this case there is no branching.