Xptxas default cache modifier on global/generic load and store not working

Hello,

I am trying to compile the vectorAdd sample application on the Jetson Xavier using the nvcc compiler version 11.4. I am adding the -Xptxas -dlcm and -dscm=cg however viewing the ptx code, the ld and st is not changing to .ld.cg and st.cg. Do you have any idea why? I tried to use also the flcm and fscm and I have the same issue

Thank you.

The -Xptxas switch doesn’t affect the generation of PTX. It affects the conversion of PTX to SASS. So it is expected that -Xptxas will have no impact on the generated PTX. Refer to the nvcc manual for further description of the compilation flow.

1 Like

Ah ok my problem is elsewhere then, since the Xptxas is not preventing of accessing the L1 cache. When I am using nsight compute, I am always obtaining % of cache hit. So the SM is always accessing the L1 cache

dlcm affects global loads. The L1 can also hit on local activity.

1 Like

Hello again,

Well I tried again, and here are the results:

First here is the command line of building the executable code (I used the makefile of NVIDIA and I just added the flcm and fscm flags in it)

Then, I did the profiling using ncu, I obtained:

It seems it is still accessing the L1 cache using the global load/store instruction.

Is Jetson Xavier a sm_70 device? Why are you compiling for compute_70, sm_70 ?

1 Like

I think it is 72 you are right. I tested again with 72 and I obtained the same results. I paid attention to something that the global load has 0% cache hit but the store has 50%. That’s why I obtained 16%.

I retried again without the dlcm and dscm flags, and the cache hit on the global load went to 50% and the store stayed almost the same (49.71%). I tried to recompile only with the dscm, and still it wasn’t bypassing the L1 global store.

Hello again,

As I was checking the nsight compute release notes, I believe maybe the ncu version on the Jetson 2022.2.1.0 may have a bug. If I understood correctly, the 2023 version of nsight compute is not available for the Jetson right?
Thanks again for your support.

When I compile the vector add sample code with your -Xptxas -dlcm=cg -Xptxas -dscm=cg switches for cc 5.2 on godbolt, and look at the SASS output:

_Z9vectorAddPKfS0_Pfi:
 MOV R1, c[0x0][0x20] 
 S2R R0, SR_CTAID.X 
 S2R R2, SR_TID.X 
 XMAD.MRG R3, R0.reuse, c[0x0] [0x8].H1, RZ 
 XMAD R2, R0.reuse, c[0x0] [0x8], R2 
 XMAD.PSL.CBCC R0, R0.H1, R3.H1, R2 
 ISETP.GE.AND P0, PT, R0, c[0x0][0x158], PT 
 NOP 
 @P0 EXIT 
 SHL R6, R0.reuse, 0x2 
 SHR R0, R0, 0x1e 
 IADD R4.CC, R6.reuse, c[0x0][0x140] 
 IADD.X R5, R0.reuse, c[0x0][0x144] 
 IADD R2.CC, R6, c[0x0][0x148] 
 LDG.E.CG R4, [R4]         // global load has .CG
 IADD.X R3, R0, c[0x0][0x14c] 
 LDG.E.CG R2, [R2]        // global load has .CG
 IADD R6.CC, R6, c[0x0][0x150] 
 IADD.X R7, R0, c[0x0][0x154] 
 FADD R0, R2, R4 
 FADD R0, RZ, R0 
 STG.E.CG [R6], R0         // global store has .CG
 NOP 
 EXIT

I see that the only global loads and the only global store are decorated with .CG.

When I compile for cc7.2 I see the “STRONG” decorator instead:

 LDG.E.STRONG.GPU R4, [R4] 
 LDG.E.STRONG.GPU R3, [R2] 
 IMAD.WIDE R6, R6, R7, c[0x0][0x170] 
 FADD R0, R4, R3 
 FADD R9, RZ, R0 
 STG.E.STRONG.GPU [R6], R9

but my understanding is that strong may imply volatile and volatile generally implies bypassing the L1. If you have further questions about this you may want to ask on a Jetson forum.

If you have questions about the nsight compute tool, I suggest asking those on the relevant forum.

Hi,

I’m having the same issue but on an RTX 3060 Ti. Using flcm/fscm doesn’t change the loads/stores either. Compiling for CC 5.2, 6.1, 7.0, 7.5, 8.6 does not add .CG for me. Compiling for 7.0 and 7.5 adds a .SYS decorator to the LDG and STG but not .CG.

Driver version: 555.42.02
CUDA version: 12.5

The full command I’m using to compile the vector addition sample is shown below:

nvcc -gencode arch=compute_86,code=sm_86 -Xptxas -dlcm=cg -Xptxas -dscm=cg -I../../common/inc vectorAdd.cu -o vectorAdd

SASS from Nsight Compute 2024:

      MOV R1, c[0x0][0x28]
      S2R R6, SR_CTAID.X
      S2R R3, SR_TID.X
      IMAD R6, R6, c[0x0][0x0], R3
      ISETP.GE.AND P0, PT, R6, c[0x0][0x178], PT
@P0   EXIT
      MOV R7, 0x4
      ULDC.64 UR4, c[0x0][0x118]
      IMAD.WIDE R4, R6, R7, c[0x0][0x168]
      IMAD.WIDE R2, R6, R7, c[0x0][0x160]
      LDG.E R4, [R4.64]                                    // no .CG
      LDG.E R3, [R2.64]                                    // no .CG
      IMAD.WIDE R6, R6, R7, c[0x0][0x170]
      FADD R9, R4, R3
      STG.E [R6.64], R9                                     // no .CG
      EXIT
      BRA 0x7f2bf525f200

According to my testing on godbolt there was a change in compiler behavior sometime between CUDA 12.3.1 and CUDA 12.4.1.

If this is of concern to you, you may wish to file a bug

1 Like