How to use PTX prefetch.global with ASM? compiles but do not see prefetch instruction with cuobjdump

Hi All,

I would like to use the PTX prefetch instruction to speed a parallel application. The basic idea is to prefetch data into the L2 cache from global memory while the kernel is busy calculating. In theory this should keep more gmem transactions in flight and reduce gmem latency stalls. It is also an excellent demonstration of the CUDA 4.0 ASM capability.

Three questions:

  1. I wrote the following inline PTX method. It compiles and runs but do not see the prefetch instruction using cuobjectdump. Any ideas (perhaps the code is being optimized out …)?

[i] template

device forceinline void prefetchASM(const T1* data, int offset) {

data += offset;

asm(“prefetch.global.L2 [%0];” :“=l”(data) );

//asm(“prefetch.global.L2 [%0];” :“=l”(data + offset) );

}

[/i]

[font=“Courier New”]2)[/font]Will prefetching past the end of an array generate a cuda-memcheck error? I’m thinking about the cost of checking for end of memory region or paying a slight cost by prefetching past the end of the allocated region.

  1. I’m also wondering if using the offset as shown in the commented out line keeps this code from using an extra register.

Thanks all!

You can’t see the prefetch instruction because it only exists on compute capability 2.x devices and cuobjdump only works for compute capability 1.x. Use nvc0dis to disassemble 2.x cubins.

The new version of cuobjdump in the CUDA 4.0 Toolkit can disassemble the Fermi instruction set (compute capability 2.x).

NVIDIA answered my question. Note that the prefetch showsup as a CCTL instruction in cudaobjdump! Their response helped me, so I post it in the hope itwill help others. Many thanks to NVIDIA for the help!

cuobjdump.exe can dump sass instructions, for the kernelsbelow, it generates the CCTL instruction for cache control when the prefetchcommand is included like below (as mentioned in the cuobjdump.pdf in the doc)

global void prefetchASMREG( float *data, int offset ){
data += offset;
asm(“prefetch.global.L2 [%0];”::“r”(data) );
data[0] = 1.0f;
}

global void prefetchASMREG2( float *data, int offset) {
asm(“prefetch.global.L2 [%0];”::“r”(data+offset) );
data[0] = 1.0f;
}

Gives something like:

            Function : _Z14prefetchASMREGPfi
/*0000*/    /*0x00005de428004404*/ 	MOV R1,c [0x1] [0x100];
/*0008*/    /*0x90001de428004000*/ 	MOV R0,c [0x0] [0x24];
/*0010*/    /*0x00009de218fe0000*/ 	MOV32IR2, 0x3f800000;
/*0018*/    /*0x80001c4340004000*/ 	ISCADDR0, R0, c [0x0] [0x20], 0x2;
/*0020*/    /*0x00001c6598000000*/    CCTL.PR2 R0, [R0];
/*0028*/ 	/*0x00009c8590000000*/ 	ST [R0], R2;
/*0030*/    /*0x00001de780000000*/ 	EXIT;
       	......................................

For the register usage, compiling with–ptxas-options=“-v” seems to show that the "data+=offset"uses 1 less register, and looks like an extra register is used if “r”(data+offset) is given instead(R2 above).

The individual believes prefetch only takes a source operandprefetch{.space}.level [a];, where [a] is a source operand, (in theptx_isa_2.3.pdf) so the kernels they tried tweaked the asm call a bit (above).

Prefetch didn’t get any memory violations using cuda-memcheckwith out-of-bounds - even passing a NULL seemed to work. Its possible to get’illegal instruction’ however, for example prefetching 0xFFFFFFFF gives thatresult.

The individual was not aware of any guarantees as to behaviour wheninvalid addresses are passed. It’s probably advisable to test in general.

Prefetching does help as can be seen in the attached figure comparing the global memory bandwidth with and without prefetching for a reduction kernel that computes the sum of a vector. Basically the kernel reads once through the vector in computing the sum.

Attached is a code snippet. More details in my upcoming book “CUDA Application Design and Development”. Note the use of a 64-bit operand with “l” operand.

const T1 *pt = data + i + N_BLOCKS*N_THREADS/sizeof(T1);
asm volatile ("prefetch.global.L2 [%0];"::"l"(pt) );

@manyThreads - thanks for a useful input. I am interested to know more about the following –

– Can compiler also insert prefetch instructions? is there a flag to enable this?

– What things should we be looking into(or be careful about) while injecting prefetch instructions in our code?

I have a register spill heavy kernel - it uses many constant addressed arrays of 4x4 double size and i guess most of them spill out.

I achieve a 65-70% L1 hit rate. Maybe prefetching can benefit me if used judiciously?

I use 48Kb L1 cache - 16Kb shared mem config. and its a tesla 2070.

Also note that [font=“Courier New”]ptxas[/font] has some interesting global load/store options:

--def-load-cache                              (-dlcm)                       	

        Default cache modifier on global/generic load.

        Default value:  'ca'.

--def-store-cache                         	(-dscm)                       	

        Default cache modifier on global/generic store.

I’ve never tried them out though.