A question about Sample cudaTensorCoreGemm.cu

In cudaTensorCoreGemm.cu, the code below is used to copy 16 bytes at once in each lane for matrix A and B:

  #pragma unroll
  for (int i = 0; i < ((WARP_SIZE / 2) / CHUNK_COPY_LINES_PER_WARP) * 2;i++) {
    // Copy 16 bytes at once in each lane.
    *((int4 *)&shmem[shmem_idx][0] + (laneId % CHUNK_COPY_LINE_LANES)) =
    // Advance the global memory pointer and the shared memory index.
    lane_ptr =Preformatted text
        (int4 *)((half *)lane_ptr + K_GLOBAL * CHUNK_COPY_LINES_PER_WARP);
    shmem_idx += CHUNK_COPY_LINES_PER_WARP;

My question is about computation of ((WARP_SIZE / 2) / CHUNK_COPY_LINES_PER_WARP) * 2, the result of this equation is 8, which is correct. However, I don’t understand what this equation actually means. Regarding the loop condition here, I think its actual meaning should be that a warp is responsible for copying 216 rows of data. Assume that the chunk is 4. There are 32 lanes in a warp, but a chunk row requires 8 lanes for copying, which means With 32 lanes, only 32/8=4 lines of data can be processed at one time, so 216/4 is equal to 8, which is the same as the previous numerical result. The formula for this idea is 2*N/CHUNK_COPY_LINES_PER_WARP, where N represents The dimension in tiles, 2 means that a warp needs to process two tile rows.
So, can anyone explain to me the actual meaning of the original loop condition in Sample?Thank you!