Question about divergence and branch granularity

Can someone help me understand how the following example used everywhere to go from divergent code to non-divergent code? If there were 16 threads in a block, In the first if statement, only threads 0 and 1 will execute the body of the if statement. However, in the second if statement, it looks like all the threads will execute the body of the if statement. How can this work if I only wanted selective threads in a block to copy something from the device memory to the shared memory? Also, is there an implicit barrier at the end of the if statement?

if (threadIdx.x < 2) { 

}

is the same as

if (threadIdx.x/WARP_SIZE < 2) { 

   // do something

}

The second if statement maybe is not something you want. as threadIdx.x in your assumption is 0~15, and if it is divided by WARP_SIZE(which currently is 32), the all threads will get 0, which, of course is less than 2. that is why all the threads will execute the body of the if statement.

I suggest you can using bit operator (and ,or , mod) to pick up selective threads to do something, for instance using the following code:

if(threadIdx.x&7){
// do something
}

to enable all threads except 0 and 8 to do something