Hi, when reading the CUDA C Best Practices Guide, I found some comments: access to register may be delayed due to read-after-writer and register memory bank coflicts. The registers are owned by the thread itself that allocate them, and the read-after-writer dependencise only are caused by the read-after-writer instructions of the same thread, but the access to the register cost zero extra clock cycly, so how can it happen? Also, what are the specific details about the register memory bank coflicts? ths
Access to the register does not consume any extra clock cycles, but any value written into that register can’t be accessed again for 16 to 24 cycles. In the mean time, the device can execute instructions from other threads , or even from the same thread, as long as it does not rely on recently written registers.