Is global memory latency still hidden by having larger thread blocks even if you have divergent branches?
Why wouldn’t it be? Also remember it’s not a divergent branch unless it splits a warp in two. (Your block can have whole warps go different ways, and there’s no affect on performance.)