We know if there is divergence in a warp, then the performance will be degraded. What about warps inside a thread block? If some warp execute longer time than other warps, then the shared memory among the thread-block will not be released until that one warp is finished, right?
Another question is: is the program executed sequentially in a warp? Can I get early return? Because I want to check whether an array contains a non-zero element, if so, then return true. If we can get a early return, then the remaining threads need not to check.