When I compile the vector add sample code with your -Xptxas -dlcm=cg -Xptxas -dscm=cg switches for cc 5.2 on godbolt, and look at the SASS output:
_Z9vectorAddPKfS0_Pfi:
MOV R1, c[0x0][0x20]
S2R R0, SR_CTAID.X
S2R R2, SR_TID.X
XMAD.MRG R3, R0.reuse, c[0x0] [0x8].H1, RZ
XMAD R2, R0.reuse, c[0x0] [0x8], R2
XMAD.PSL.CBCC R0, R0.H1, R3.H1, R2
ISETP.GE.AND P0, PT, R0, c[0x0][0x158], PT
NOP
@P0 EXIT
SHL R6, R0.reuse, 0x2
SHR R0, R0, 0x1e
IADD R4.CC, R6.reuse, c[0x0][0x140]
IADD.X R5, R0.reuse, c[0x0][0x144]
IADD R2.CC, R6, c[0x0][0x148]
LDG.E.CG R4, [R4] // global load has .CG
IADD.X R3, R0, c[0x0][0x14c]
LDG.E.CG R2, [R2] // global load has .CG
IADD R6.CC, R6, c[0x0][0x150]
IADD.X R7, R0, c[0x0][0x154]
FADD R0, R2, R4
FADD R0, RZ, R0
STG.E.CG [R6], R0 // global store has .CG
NOP
EXIT
I see that the only global loads and the only global store are decorated with .CG.
When I compile for cc7.2 I see the “STRONG” decorator instead:
LDG.E.STRONG.GPU R4, [R4]
LDG.E.STRONG.GPU R3, [R2]
IMAD.WIDE R6, R6, R7, c[0x0][0x170]
FADD R0, R4, R3
FADD R9, RZ, R0
STG.E.STRONG.GPU [R6], R9
but my understanding is that strong may imply volatile and volatile generally implies bypassing the L1. If you have further questions about this you may want to ask on a Jetson forum.
If you have questions about the nsight compute tool, I suggest asking those on the relevant forum.