It appears to be cached in L1(*) but not in L2. So my previous comment is not correct.
analyzing PTX isn’t very reliable. The tool that converts PTX to the machine code that actually executes is an optimizing compiler.
I think a simpler test case can be used.
$ cat t403.cu
#include <stdio.h>
__global__ void k(int *d){
int a = d[threadIdx.x];
#ifdef USE_ADD
a += d[blockDim.x-threadIdx.x-1];
#endif
if (a > 0) d[threadIdx.x] = a;
}
const int ds = 128;
int main(){
int *d;
cudaHostAlloc(&d, ds*sizeof(d[0]), cudaHostAllocDefault);
memset(d, 0, ds*sizeof(d[0]));
k<<<128,ds>>>(d);
cudaDeviceSynchronize();
}
$ nvcc -arch=sm_60 -o t403 t403.cu
$ cuobjdump -sass ./t403
Fatbin elf code:
================
arch = sm_60
code version = [1,7]
producer = <unknown>
host = linux
compile_size = 64bit
code for sm_60
Fatbin elf code:
================
arch = sm_60
code version = [1,7]
producer = cuda
host = linux
compile_size = 64bit
code for sm_60
Function : _Z1kPi
.headerflags @"EF_CUDA_SM60 EF_CUDA_PTX_SM(EF_CUDA_SM60)"
/* 0x083fc400e3e007f6 */
/*0008*/ MOV R1, c[0x0][0x20] ; /* 0x4c98078000870001 */
/*0010*/ S2R R2, SR_TID.X ; /* 0xf0c8000002170002 */
/*0018*/ SHR.U32 R0, R2.reuse, 0x1e ; /* 0x3828000001e70200 */
/* 0x001ed400fc4007e6 */
/*0028*/ ISCADD R2.CC, R2, c[0x0][0x140], 0x2 ; /* 0x4c18810005070202 */
/*0030*/ IADD.X R3, R0, c[0x0][0x144] ; /* 0x4c10080005170003 */
/*0038*/ LDG.E R0, [R2] ; /* 0xeed4200000070200 */
/* 0x001ff400fd4107ed */
/*0048*/ ISETP.GE.AND P0, PT, R0, 0x1, PT ; /* 0x366d038000170007 */
/*0050*/ NOP ; /* 0x50b0000000070f00 */
/*0058*/ @!P0 EXIT ; /* 0xe30000000008000f */
/* 0x001fbc00fde007f1 */
/*0068*/ STG.E [R2], R0 ; /* 0xeedc200000070200 */
/*0070*/ NOP ; /* 0x50b0000000070f00 */
/*0078*/ NOP ; /* 0x50b0000000070f00 */
/* 0x001ffc00fc6007ef */
/*0088*/ NOP ; /* 0x50b0000000070f00 */
/*0090*/ NOP ; /* 0x50b0000000070f00 */
/*0098*/ EXIT ; /* 0xe30000000007000f */
/* 0x001f8000fc0007ff */
/*00a8*/ BRA 0xa0 ; /* 0xe2400fffff07000f */
/*00b0*/ NOP; /* 0x50b0000000070f00 */
/*00b8*/ NOP; /* 0x50b0000000070f00 */
.................
Fatbin ptx code:
================
arch = sm_60
code version = [6,3]
producer = cuda
host = linux
compile_size = 64bit
compressed
$ nvprof --metrics sysmem_read_bytes ./t403
==13132== NVPROF is profiling process 13132, command: ./t403
==13132== Profiling application: ./t403
==13132== Profiling result:
==13132== Metric result:
Invocations Metric Name Metric Description Min Max Avg
Device "Tesla P100-PCIE-16GB (0)"
Kernel: k(int*)
1 sysmem_read_bytes System Memory Read Bytes 28672 28672 28672
$ nvcc -arch=sm_60 -o t403 t403.cu -DUSE_ADD
$ cuobjdump -sass ./t403
Fatbin elf code:
================
arch = sm_60
code version = [1,7]
producer = <unknown>
host = linux
compile_size = 64bit
code for sm_60
Fatbin elf code:
================
arch = sm_60
code version = [1,7]
producer = cuda
host = linux
compile_size = 64bit
code for sm_60
Function : _Z1kPi
.headerflags @"EF_CUDA_SM60 EF_CUDA_PTX_SM(EF_CUDA_SM60)"
/* 0x003fc400e3e007f6 */
/*0008*/ MOV R1, c[0x0][0x20] ; /* 0x4c98078000870001 */
/*0010*/ S2R R2, SR_TID.X ; /* 0xf0c8000002170002 */
/*0018*/ LOP.PASS_B R0, RZ, ~R2 ; /* 0x5c4707000027ff00 */
/* 0x001f8800fc8207f1 */
/*0028*/ SHR.U32 R3, R2.reuse, 0x1e ; /* 0x3828000001e70203 */
/*0030*/ ISCADD R2.CC, R2, c[0x0][0x140], 0x2 ; /* 0x4c18810005070202 */
/*0038*/ IADD R0, R0, c[0x0][0x8] ; /* 0x4c10000000270000 */
/* 0x001fc440fe0007f4 */
/*0048*/ IADD.X R3, R3, c[0x0][0x144] ; /* 0x4c10080005170303 */
/*0050*/ { SHR.U32 R5, R0.reuse, 0x1e ; /* 0x3828000001e70005 */
/*0058*/ LDG.E R6, [R2] }
/* 0xeed4200000070206 */
/* 0x001ed400fc4007e6 */
/*0068*/ ISCADD R4.CC, R0, c[0x0][0x140], 0x2 ; /* 0x4c18810005070004 */
/*0070*/ IADD.X R5, R5, c[0x0][0x144] ; /* 0x4c10080005170505 */
/*0078*/ LDG.E R4, [R4] ; /* 0xeed4200000070404 */
/* 0x001ff400fda107f6 */
/*0088*/ IADD R0, R4, R6 ; /* 0x5c10000000670400 */
/*0090*/ ISETP.GE.AND P0, PT, R0, 0x1, PT ; /* 0x366d038000170007 */
/*0098*/ @!P0 EXIT ; /* 0xe30000000008000f */
/* 0x001fbc00fde007f1 */
/*00a8*/ STG.E [R2], R0 ; /* 0xeedc200000070200 */
/*00b0*/ NOP ; /* 0x50b0000000070f00 */
/*00b8*/ NOP ; /* 0x50b0000000070f00 */
/* 0x001ffc00fc6007ef */
/*00c8*/ NOP ; /* 0x50b0000000070f00 */
/*00d0*/ NOP ; /* 0x50b0000000070f00 */
/*00d8*/ EXIT ; /* 0xe30000000007000f */
/* 0x001f8000fc0007ff */
/*00e8*/ BRA 0xe0 ; /* 0xe2400fffff07000f */
/*00f0*/ NOP; /* 0x50b0000000070f00 */
/*00f8*/ NOP; /* 0x50b0000000070f00 */
.................
Fatbin ptx code:
================
arch = sm_60
code version = [6,3]
producer = cuda
host = linux
compile_size = 64bit
compressed
$ nvprof --metrics sysmem_read_bytes ./t403
==13189== NVPROF is profiling process 13189, command: ./t403
==13189== Profiling application: ./t403
==13189== Profiling result:
==13189== Metric result:
Invocations Metric Name Metric Description Min Max Avg
Device "Tesla P100-PCIE-16GB (0)"
Kernel: k(int*)
1 sysmem_read_bytes System Memory Read Bytes 28672 28672 28672
$
In the above example, we see that the total number of sysmem read bytes is 56*512
, which is 56 SMs in my P100, times 512 bytes (128*sizeof(float)). Therefore it’s not cached in L2. We see that the total number of bytes is unchanged if I do 2 reads in the kernel, so it is being cached in L1. In the first case, the SASS shows one LDG instruction, whereas in the second test case there are 2.
(*)Note that the behavior could possibly be different based on the GPU. Kepler devices have more limited usage of the L1 cache; sysmem transactions don’t seem to be cached by default on cc3.5 devices. On those devices, a similar metric test (sysmem_read_transactions) shows a doubling of the transaction count when we have 2 LDG instructions in the kernel.