A bug in the code example in the official document of CUDA staging pipeline

I found a bug in the code example in this link:

instead of “global_idx” :

// Computation overlapped with the memcpy_async of the “copy” stage:
compute(global_out + global_idx, shared + shared_offset[compute_stage_idx]);

we should use “block_batch(batch - 1)”:

   compute(global_out + block_batch(batch - 1), shared + shared_offset[compute_stage_idx]);