How to use PTX prefetch.global with ASM? compiles but do not see prefetch instruction with cuobjdump

manyThreads · May 26, 2011, 12:29pm

Hi All,

I would like to use the PTX prefetch instruction to speed a parallel application. The basic idea is to prefetch data into the L2 cache from global memory while the kernel is busy calculating. In theory this should keep more gmem transactions in flight and reduce gmem latency stalls. It is also an excellent demonstration of the CUDA 4.0 ASM capability.

Three questions:

I wrote the following inline PTX method. It compiles and runs but do not see the prefetch instruction using cuobjectdump. Any ideas (perhaps the code is being optimized out …)?

[i] template

device forceinline void prefetchASM(const T1* data, int offset) {

data += offset;

asm(“prefetch.global.L2 [%0];” :“=l”(data) );

//asm(“prefetch.global.L2 [%0];” :“=l”(data + offset) );

}

[/i]

[font=“Courier New”]2)[/font]Will prefetching past the end of an array generate a cuda-memcheck error? I’m thinking about the cost of checking for end of memory region or paying a slight cost by prefetching past the end of the allocated region.

I’m also wondering if using the offset as shown in the commented out line keeps this code from using an extra register.

Thanks all!

tera · May 26, 2011, 1:19pm

You can’t see the prefetch instruction because it only exists on compute capability 2.x devices and cuobjdump only works for compute capability 1.x. Use nvc0dis to disassemble 2.x cubins.

jack · May 26, 2011, 4:49pm

The new version of cuobjdump in the CUDA 4.0 Toolkit can disassemble the Fermi instruction set (compute capability 2.x).

manyThreads · June 2, 2011, 8:50am

NVIDIA answered my question. Note that the prefetch showsup as a CCTL instruction in cudaobjdump! Their response helped me, so I post it in the hope itwill help others. Many thanks to NVIDIA for the help!

cuobjdump.exe can dump sass instructions, for the kernelsbelow, it generates the CCTL instruction for cache control when the prefetchcommand is included like below (as mentioned in the cuobjdump.pdf in the doc)

global void prefetchASMREG( float *data, int offset ){
data += offset;
asm(“prefetch.global.L2 [%0];”::“r”(data) );
data[0] = 1.0f;
}

global void prefetchASMREG2( float *data, int offset) {
asm(“prefetch.global.L2 [%0];”::“r”(data+offset) );
data[0] = 1.0f;
}

Gives something like:

            Function : _Z14prefetchASMREGPfi
/*0000*/    /*0x00005de428004404*/ 	MOV R1,c [0x1] [0x100];
/*0008*/    /*0x90001de428004000*/ 	MOV R0,c [0x0] [0x24];
/*0010*/    /*0x00009de218fe0000*/ 	MOV32IR2, 0x3f800000;
/*0018*/    /*0x80001c4340004000*/ 	ISCADDR0, R0, c [0x0] [0x20], 0x2;
/*0020*/    /*0x00001c6598000000*/    CCTL.PR2 R0, [R0];
/*0028*/ 	/*0x00009c8590000000*/ 	ST [R0], R2;
/*0030*/    /*0x00001de780000000*/ 	EXIT;
       	......................................

For the register usage, compiling with–ptxas-options=“-v” seems to show that the "data+=offset"uses 1 less register, and looks like an extra register is used if “r”(data+offset) is given instead(R2 above).

The individual believes prefetch only takes a source operandprefetch{.space}.level [a];, where [a] is a source operand, (in theptx_isa_2.3.pdf) so the kernels they tried tweaked the asm call a bit (above).

Prefetch didn’t get any memory violations using cuda-memcheckwith out-of-bounds - even passing a NULL seemed to work. Its possible to get’illegal instruction’ however, for example prefetching 0xFFFFFFFF gives thatresult.

The individual was not aware of any guarantees as to behaviour wheninvalid addresses are passed. It’s probably advisable to test in general.

manyThreads · June 10, 2011, 9:56am

NVIDIA answered my question. Note that the prefetch showsup as a CCTL instruction in cudaobjdump! Their response helped me, so I post it in the hope itwill help others. Many thanks to NVIDIA for the help!

cuobjdump.exe can dump sass instructions, for the kernelsbelow, it generates the CCTL instruction for cache control when the prefetchcommand is included like below (as mentioned in the cuobjdump.pdf in the doc)

global void prefetchASMREG( float *data, int offset ){
data += offset;
asm(“prefetch.global.L2 [%0];”::“r”(data) );
data[0] = 1.0f;
}

global void prefetchASMREG2( float *data, int offset) {

asm(“prefetch.global.L2 [%0];”::“r”(data+offset) );
data[0] = 1.0f;
}

Gives something like:

Function : _Z14prefetchASMREGPfi
/*0000*/    /*0x00005de428004404*/ 	MOV R1,c [0x1] [0x100];

/*0008*/    /*0x90001de428004000*/ 	MOV R0,c [0x0] [0x24];

/*0010*/    /*0x00009de218fe0000*/ 	MOV32IR2, 0x3f800000;

/*0018*/    /*0x80001c4340004000*/ 	ISCADDR0, R0, c [0x0] [0x20], 0x2;

/*0020*/    /*0x00001c6598000000*/    CCTL.PR2 R0, [R0];

/*0028*/ 	/*0x00009c8590000000*/ 	ST [R0], R2;

/*0030*/    /*0x00001de780000000*/ 	EXIT;

   		......................................
For the register usage, compiling with–ptxas-options=“-v” seems to show that the "data+=offset"uses 1 less register, and looks like an extra register is used if “r”(data+offset) is given instead(R2 above).

The individual believes prefetch only takes a source operandprefetch{.space}.level [a];, where [a] is a source operand, (in theptx_isa_2.3.pdf) so the kernels they tried tweaked the asm call a bit (above).

Prefetch didn’t get any memory violations using cuda-memcheckwith out-of-bounds - even passing a NULL seemed to work. Its possible to get’illegal instruction’ however, for example prefetching 0xFFFFFFFF gives thatresult.

The individual was not aware of any guarantees as to behaviour wheninvalid addresses are passed. It’s probably advisable to test in general.

manyThreads · June 10, 2011, 10:07am

Prefetching does help as can be seen in the attached figure comparing the global memory bandwidth with and without prefetching for a reduction kernel that computes the sum of a vector. Basically the kernel reads once through the vector in computing the sum.

Attached is a code snippet. More details in my upcoming book “CUDA Application Design and Development”. Note the use of a 64-bit operand with “l” operand.

const T1 *pt = data + i + N_BLOCKS*N_THREADS/sizeof(T1);
asm volatile ("prefetch.global.L2 [%0];"::"l"(pt) );

sidxavier · May 6, 2012, 10:45pm

Prefetching does help as can be seen in the attached figure comparing the global memory bandwidth with and without prefetching for a reduction kernel that computes the sum of a vector. Basically the kernel reads once through the vector in computing the sum.

Attached is a code snippet. More details in my upcoming book “CUDA Application Design and Development”. Note the use of a 64-bit operand with “l” operand.

const T1 pt = data + i + N_BLOCKSN_THREADS/sizeof(T1);
asm volatile ("prefetch.global.L2 [%0];"::"l"(pt) );

@manyThreads - thanks for a useful input. I am interested to know more about the following –

– Can compiler also insert prefetch instructions? is there a flag to enable this?

– What things should we be looking into(or be careful about) while injecting prefetch instructions in our code?

I have a register spill heavy kernel - it uses many constant addressed arrays of 4x4 double size and i guess most of them spill out.

I achieve a 65-70% L1 hit rate. Maybe prefetching can benefit me if used judiciously?

I use 48Kb L1 cache - 16Kb shared mem config. and its a tesla 2070.

allanmac · May 7, 2012, 12:47am

Also note that [font=“Courier New”]ptxas[/font] has some interesting global load/store options:

--def-load-cache                              (-dlcm)                       	

        Default cache modifier on global/generic load.

        Default value:  'ca'.

--def-store-cache                         	(-dscm)                       	

        Default cache modifier on global/generic store.

I’ve never tried them out though.

Topic		Replies	Views
Boosting Application Performance with GPU Memory Prefetching Technical Blog	7	1162	March 9, 2023
preventing ptxas from reordering instructions CUDA Programming and Performance	23	6142	December 2, 2022
low level hardware documentation CUDA Programming and Performance	23	3572	November 28, 2014
Help!! I can't get my NVidia GeForce GT 525M to load in a single CUDA PTX kernel!! CUDA Programming and Performance	11	5806	November 16, 2012
Some problems with inline PTX CUDA Programming and Performance	6	1814	March 6, 2013
Ptxas slow CUDA Programming and Performance cuda , kernel	35	2101	May 2, 2024
Problems with hand-made PTX and driver API Difficulty getting a simple hand-written PTX program to w CUDA Programming and Performance	13	3218	October 12, 2011
Some issues regarding the use of prefetch in the cuda kernel CUDA Programming and Performance cuda , kernel	19	142	June 11, 2025
CUDA Toolkit 3.0 update GPU HW debugging tools to replace device emulation CUDA Programming and Performance	44	29467	April 29, 2010
PTX Emulator Released CUDA Programming and Performance	32	8301	July 15, 2009

How to use PTX prefetch.global with ASM? compiles but do not see prefetch instruction with cuobjdump

Related topics