I found a bug in the code example in this link:
instead of “global_idx” :
// Computation overlapped with the memcpy_async of the “copy” stage:
compute(global_out + global_idx, shared + shared_offset[compute_stage_idx]);
we should use “block_batch(batch - 1)”:
compute(global_out + block_batch(batch - 1), shared + shared_offset[compute_stage_idx]);