Hi All,
I would like to use the PTX prefetch instruction to speed a parallel application. The basic idea is to prefetch data into the L2 cache from global memory while the kernel is busy calculating. In theory this should keep more gmem transactions in flight and reduce gmem latency stalls. It is also an excellent demonstration of the CUDA 4.0 ASM capability.
Three questions:
- I wrote the following inline PTX method. It compiles and runs but do not see the prefetch instruction using cuobjectdump. Any ideas (perhaps the code is being optimized out …)?
[i] template
device forceinline void prefetchASM(const T1* data, int offset) {
data += offset;
asm(“prefetch.global.L2 [%0];” :“=l”(data) );
//asm(“prefetch.global.L2 [%0];” :“=l”(data + offset) );
}
[/i]
[font=“Courier New”]2)[/font]Will prefetching past the end of an array generate a cuda-memcheck error? I’m thinking about the cost of checking for end of memory region or paying a slight cost by prefetching past the end of the allocated region.
- I’m also wondering if using the offset as shown in the commented out line keeps this code from using an extra register.
Thanks all!