Cp.async caused Uncoalesced Shared Accesses and many L1 Wavefronts Shared Excessive

Hi I am working on a kernel which uses cp.async as part of a 4-stage pipe line.
The Nsight Compute profiling showed these cp.async’s are causes of significant of amounts of L1 Wavefronts Shared Excessive (see picture below). However, the memory table shows no bank conflicts due to cp.async. Granted, the loading from global memory is not optimal and uncoalesced, but that should show up as L2 Global excessive.

Here is part of the code doing cp.async:

  int tx = threadIdx.x;
  int ty = threadIdx.y;
  
  for(int k = 0; k < 2; k++){ 
  
    void *ptr = (void *)(smem + k*8*(BC*16) + ty*(BC*16) + tx * 8);
  
    unsigned int smem_ptr;

    asm("{ .reg .u64 smem_ptr; cvta.to.shared.u64 smem_ptr, %1; cvt.u32.u64 "
        "%0, smem_ptr; }\n"
        : "=r"(smem_ptr)
        : "l"(ptr));

    asm volatile("cp.async.cg.shared.global [%0], [%1], %2;\n" ::"r"(smem_ptr),
                "l"(&pInputs[c_tensor + k * 2 * (filt_c<<2) + (ty/4) * (filt_c<<2) + (ty % 4) * filt_c 
                          + (tx/2)*c_offset + (tx%2)*8 + ko * 16 * c_offset]),
                "n"(16));
  }