inter warp vs intra warp

We know if there is divergence in a warp, then the performance will be degraded. What about warps inside a thread block? If some warp execute longer time than other warps, then the shared memory among the thread-block will not be released until that one warp is finished, right?

Another question is: is the program executed sequentially in a warp? Can I get early return? Because I want to check whether an array contains a non-zero element, if so, then return true. If we can get a early return, then the remaining threads need not to check.

I’m pretty sure you’re correct for you first question, the resources consumed by a block won’t be released until the entire block has completed, or at least that’s my understanding of it.

As for your second question, instructions are dispatched on a per-warp basis, so the entire warp will execute the check at the same time. Even at the block level ensuring an early return could be tricky; you could write a flag to shared memory on finding a non-zero element that each warp could read before attempting it’s own check, but the behaviour in terms of which warps would be able to read this value when it changes would be undefined. Acceleware gave a talk at GTC containing some details about these kinds of warp scheduling problems. [url]http://nvidia.fullviewmedia.com/gtc2013/0318-212A-S3453.html[/url]

Coordinating the early termination of a block is relatively easy because there are many synchronization options available for threads in the same block. In the case you describe of finding a non-zero element, having a shared memory termination flag makes sense, though make sure you execute __threadfence_block() after the assignment to ensure the write is visible to other threads.