When I employ the CCTL instruction(prefetch in PTX)before an LDG instruction, it should significantly boost my program’s performance if the CCTL is capable of executing concurrently with the LDG. However, the improvement I’ve encountered does not align with my expectations. This phenomenon can be rationalized if the LDG instruction is compelled to await the completion of the CCTL. Is there anyone aware if the CCTL instruction indeed halts subsequent DRAM load operations?
Related topics
| Topic | Replies | Views | Activity | |
|---|---|---|---|---|
| Does the prefetch instruction delay the loading of the ld instruction? | 4 | 327 | August 9, 2024 | |
| Tuning a kernel with LDG(ON/OFF,array) and prefetching | 12 | 19684 | June 3, 2020 | |
| Some issues regarding the use of prefetch in the cuda kernel | 19 | 497 | June 11, 2025 | |
| blocking behavior of LD/ST from/to global memory | 3 | 1550 | April 29, 2011 | |
| How do I understand data prefetching with a double buffer? | 1 | 671 | March 21, 2024 | |
| global memory prefetch is there any way ? | 8 | 6454 | March 26, 2009 | |
| preventing ptxas from reordering instructions | 23 | 6599 | December 2, 2022 | |
| How to use PTX prefetch.global with ASM? compiles but do not see prefetch instruction with cuobjdump | 7 | 5482 | May 7, 2012 | |
| Instruction cache and instruction fetch stalls | 1 | 2092 | June 26, 2019 | |
| Is manual prefetching useless? | 7 | 3992 | August 19, 2011 |