Vpr-carveout causing kernel failure related to mc-err

Recently we upgraded our AGX Xavier(64GB version) to L4T 35.3.1 (JetPack 5.1.1) on CTI rogue, and we noticed that a simple code of allocate 55GB of vram via cudaMalloc, then try to write 4k to each of the 1MB block(just as a quick way to see where we see error), it will 100% time trigger an error in the application(wrong argument), with kernel logs like this:

1 Oct 09 12:30:21 argus-026-040-0666-01 kernel: mc-err: (255) csw_nvl4w: EMEM address decode error 2 Oct 09 12:30:21 argus-026-040-0666-01 kernel: mc-err: status = 0x200100bb; addr = 0x68570400; hi_adr_reg=0x0 3 Oct 09 12:30:21 argus-026-040-0666-01 kernel: mc-err: secure: no, access-type: write

This is accompanied by a GPU fault, confirming the illegal write is from the GPU.

1 nvgpu: 17000000.gv11b gv11b_fb_mmu_fault_info_dump:294 [ERR] [MMU FAULT] ... fault addr: 0x739ecd000, ... access type: virt write

During out investigation, we noticed that even in the dtb we have vpr-carveout disabled, if we dump the running system’s device tree, we still see:



   1     vpr-carveout {
   2             compatible = "nvidia,vpr-carveout";
   3             status = "okay";
   4             reg = <0x00 0xce000000 0x00 0x2a000000>;
   5             phandle = <0x9b>;
   6     };


I wonder if this is related to this issue. And if it is, what is the right way to disable it? Since the original dtb under /boot/dtb/ (referenced by /boot/extlinux/extlinux.conf) have it disabled

TIMEOUT 30
DEFAULT primary

MENU TITLE L4T boot options

LABEL primary
      MENU LABEL primary kernel
      LINUX /boot/Image
      FDT /boot/dtb/kernel_tegra194-agx-cti-AGX101-JCB005-AVT-CSI2-4CAM.dtb
      INITRD /boot/initrd
      APPEND ${cbootargs} root=/dev/mmcblk0p1 rw rootwait rootfstype=ext4 mminit_loglevel=4 console=ttyTCU0,115200n8 console=tty0 fbcon=map:0 net.ifnames=0 rootfstype=ext4 sdhci_tegra.en_boot_part_access=1 video=efifb:off
                vpr-carveout {
                        compatible = "nvidia,vpr-carveout";
                        status = "disabled";
                        phandle = <0x2a6>;
                };

just to give more context:

  1. when we run the test(allocate a single 55GB vram then write), there was enough free mem, and , the allocation was successful. The failure always come after some writing.
  2. if we allocate smaller vram, it will still happen but not as reliably: even only 1GB can trigger it sometimes.
  3. the reason we are doing this test is because we noticed a non-deterministic error happening to our live system. this is just a way to reproduce it.
  4. our theory is that because of this carveout, some page of the allocation we get belongs to some memory area that is not actually accessible by the GPU.
  5. that carveout is not present in R32 before we upgrade and the problem is not triggered there either.

Attached the code to reproduce

int memtest(int sz_g) {
    uint8_t *data = new uint8_t[4 * 1024];
    for (int i = 0; i < 4 * 1024; i++) {
        data[i] = i % 255;
    }
    
    uint8_t *d_data;
    size_t sz = sz_g * 1024;
    cudaError_t err = cudaMalloc((void**)&d_data, sz * 1024 * 1024);
    if (err != cudaSuccess) {
        std::cerr << "cudaMalloc failed: " << cudaGetErrorString(err) << std::endl;
        return -1;
    }
    for (size_t i = 0; i < sz; i++) {
        err = cudaMemcpy(d_data + i * 1024 * 1024, data, 4 * 1024, cudaMemcpyHostToDevice);
        if (err != cudaSuccess) {
            std::cerr << "cudaMemcpy failed at iteration " << i << ": " << cudaGetErrorString(err) << std::endl;
            // cudaFree(d_data);
            return -1;
        }
    
    }
    printf("cudaMemcpy success for %d GB\n", sz_g);
    return 0;
}

Hello @wsmlby,

At some point we were getting a VRP error, similar to the one you are experiencing, when using the AV1 HW encoder. Although I don’t believe is the same one:

[ +0.009097] tegra-mc 2c00000.memory-controller: nvencswr: secure write
@0x00000003ffffff00: Route Sanity error ((null))
[ +0.019064] tegra-mc 2c00000.memory-controller: unknown: secure read
@0x000000ffffffff00: EMEM address decode error (EMEM decode error)
[ +0.001501] tegra-mc 2c00000.memory-controller: nvencswr: secure write
@0x00000003ffffff00: VPR violation ((null))

For what is worth, it ended up being caused by using an image resolution that was not 64 aligned and it was causing memory management issues.

Also, doing a bit of searching, I found this:

I was thinking that it might be worth trying something similar for your AGX Xavier ?

Please keep us posted on test results, we might come up with some more test ideas down the line.

best regards,
Andrew
Embedded Software Engineer at ProventusNova

not sure how can we disable it. The current dts is already disabled in the dtb file:

                vpr-carveout {
                        compatible = "nvidia,vpr-carveout";
                        status = "disabled";
                        phandle = <0x9b>;
                };

but it got enabled by Nvidia’s UEFI

You can delete the vpr-carveout{} to disable it.

Thanks

Thanks. That successfully disabled the carveout but I am still seeing the failure. Can you help? the reproduction code is really simple(provided above).