Why does L2045 have long scoreboard?

I am use RTX3080 with cuda 11.8

L2041 will call:

I have 3 quesions:

  1. How does LDGDEPBAR work? I am using LDGSTS.E.BYPASS, I don’t use LDG. Can I remove LDGDEPBAR instruction?
  2. How does WARPSYNC(on L3796) work?
  3. Why does L2045 have so long long scoreboard?

Thanks!

Is it not the case that LDGSTS.E.BYPASS is a variant of LDG that transfers data directly to shared memory? Which indicates that LDGDEPBAR is there for a reason. Why do you want to remove it?

I may well be misunderstanding the situation, but as this appears to involve global->shared transfer, you could be seeing impact as outlined here:

" For compute capability 8.x, the pipeline mechanism is shared among CUDA threads in the same CUDA warp. This sharing causes batches of memcpy_async to be entangled within a warp, which can impact performance under certain circumstances."