Do "prefetch" PTX instructions (CCTL) inherently include memory barriers?

When I employ the CCTL instruction(prefetch in PTX)before an LDG instruction, it should significantly boost my program’s performance if the CCTL is capable of executing concurrently with the LDG. However, the improvement I’ve encountered does not align with my expectations. This phenomenon can be rationalized if the LDG instruction is compelled to await the completion of the CCTL. Is there anyone aware if the CCTL instruction indeed halts subsequent DRAM load operations?