Vpr-carveout causing kernel failure related to mc-err

wsmlby · October 27, 2025, 2:14pm

Recently we upgraded our AGX Xavier(64GB version) to L4T 35.3.1 (JetPack 5.1.1) on CTI rogue, and we noticed that a simple code of allocate 55GB of vram via cudaMalloc, then try to write 4k to each of the 1MB block(just as a quick way to see where we see error), it will 100% time trigger an error in the application(wrong argument), with kernel logs like this:

1 Oct 09 12:30:21 argus-026-040-0666-01 kernel: mc-err: (255) csw_nvl4w: EMEM address decode error 2 Oct 09 12:30:21 argus-026-040-0666-01 kernel: mc-err: status = 0x200100bb; addr = 0x68570400; hi_adr_reg=0x0 3 Oct 09 12:30:21 argus-026-040-0666-01 kernel: mc-err: secure: no, access-type: write

This is accompanied by a GPU fault, confirming the illegal write is from the GPU.

1 nvgpu: 17000000.gv11b gv11b_fb_mmu_fault_info_dump:294 [ERR] [MMU FAULT] ... fault addr: 0x739ecd000, ... access type: virt write

During out investigation, we noticed that even in the dtb we have vpr-carveout disabled, if we dump the running system’s device tree, we still see:



   1     vpr-carveout {
   2             compatible = "nvidia,vpr-carveout";
   3             status = "okay";
   4             reg = <0x00 0xce000000 0x00 0x2a000000>;
   5             phandle = <0x9b>;
   6     };

I wonder if this is related to this issue. And if it is, what is the right way to disable it? Since the original dtb under /boot/dtb/ (referenced by /boot/extlinux/extlinux.conf) have it disabled

TIMEOUT 30
DEFAULT primary

MENU TITLE L4T boot options

LABEL primary
      MENU LABEL primary kernel
      LINUX /boot/Image
      FDT /boot/dtb/kernel_tegra194-agx-cti-AGX101-JCB005-AVT-CSI2-4CAM.dtb
      INITRD /boot/initrd
      APPEND ${cbootargs} root=/dev/mmcblk0p1 rw rootwait rootfstype=ext4 mminit_loglevel=4 console=ttyTCU0,115200n8 console=tty0 fbcon=map:0 net.ifnames=0 rootfstype=ext4 sdhci_tegra.en_boot_part_access=1 video=efifb:off

                vpr-carveout {
                        compatible = "nvidia,vpr-carveout";
                        status = "disabled";
                        phandle = <0x2a6>;
                };

wsmlby · October 27, 2025, 2:27pm

just to give more context:

when we run the test(allocate a single 55GB vram then write), there was enough free mem, and , the allocation was successful. The failure always come after some writing.
if we allocate smaller vram, it will still happen but not as reliably: even only 1GB can trigger it sometimes.
the reason we are doing this test is because we noticed a non-deterministic error happening to our live system. this is just a way to reproduce it.
our theory is that because of this carveout, some page of the allocation we get belongs to some memory area that is not actually accessible by the GPU.
that carveout is not present in R32 before we upgrade and the problem is not triggered there either.

Attached the code to reproduce

int memtest(int sz_g) {
    uint8_t *data = new uint8_t[4 * 1024];
    for (int i = 0; i < 4 * 1024; i++) {
        data[i] = i % 255;
    }
    
    uint8_t *d_data;
    size_t sz = sz_g * 1024;
    cudaError_t err = cudaMalloc((void**)&d_data, sz * 1024 * 1024);
    if (err != cudaSuccess) {
        std::cerr << "cudaMalloc failed: " << cudaGetErrorString(err) << std::endl;
        return -1;
    }
    for (size_t i = 0; i < sz; i++) {
        err = cudaMemcpy(d_data + i * 1024 * 1024, data, 4 * 1024, cudaMemcpyHostToDevice);
        if (err != cudaSuccess) {
            std::cerr << "cudaMemcpy failed at iteration " << i << ": " << cudaGetErrorString(err) << std::endl;
            // cudaFree(d_data);
            return -1;
        }
    
    }
    printf("cudaMemcpy success for %d GB\n", sz_g);
    return 0;
}

proventusnova · October 27, 2025, 6:55pm

Hello @wsmlby,

At some point we were getting a VRP error, similar to the one you are experiencing, when using the AV1 HW encoder. Although I don’t believe is the same one:

[ +0.009097] tegra-mc 2c00000.memory-controller: nvencswr: secure write
@0x00000003ffffff00: Route Sanity error ((null))
[ +0.019064] tegra-mc 2c00000.memory-controller: unknown: secure read
@0x000000ffffffff00: EMEM address decode error (EMEM decode error)
[ +0.001501] tegra-mc 2c00000.memory-controller: nvencswr: secure write
@0x00000003ffffff00: VPR violation ((null))

For what is worth, it ended up being caused by using an image resolution that was not 64 aligned and it was causing memory management issues.

Also, doing a bit of searching, I found this:

I was thinking that it might be worth trying something similar for your AGX Xavier ?

Please keep us posted on test results, we might come up with some more test ideas down the line.

best regards,
Andrew
Embedded Software Engineer at ProventusNova

wsmlby · October 31, 2025, 8:09pm

not sure how can we disable it. The current dts is already disabled in the dtb file:

                vpr-carveout {
                        compatible = "nvidia,vpr-carveout";
                        status = "disabled";
                        phandle = <0x9b>;
                };

but it got enabled by Nvidia’s UEFI

ShaneCCC · November 3, 2025, 7:59am

You can delete the vpr-carveout{} to disable it.

Thanks

wsmlby · November 3, 2025, 7:06pm

Thanks. That successfully disabled the carveout but I am still seeing the failure. Can you help? the reproduction code is really simple(provided above).

AastaLLL · November 19, 2025, 9:03am

Hi,

Could you try if the same issue also occurs on Xavier devkit with the latest r35.6.2?
If so, please share a reproducible so we can check it further internally.

Thanks.

system · December 16, 2025, 1:33am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
cudaMalloc failing with ~700MB free RAM on TX2 NX (4GB) - vpr-carveout workaround Jetson TX2 cuda	2	166	July 5, 2024
JP 5.0.2 missing ~1GB volatile memory Jetson Xavier NX kernel	35	2475	November 23, 2022
An nvgpu error causes all GPu-dependent services to fail Jetson AGX Xavier cuda	6	616	March 6, 2024
Memory errors on AGX Xavier 64GB with jetpack 4.6.3 Jetson AGX Xavier kernel , board-design	10	1115	October 16, 2023
Cuda Driver (an illegal memory access was encountered) DeepStream SDK	9	2179	May 4, 2024
Mc-err: (255) csr_vicsrd: EMEM address decode error Jetson AGX Orin camera , gstreamer	11	399	July 19, 2024
Vi5 error： Unable to handle kernel NULL pointer dereference at virtual address 00000000 Jetson AGX Xavier camera , gstreamer	26	1507	March 3, 2024
AGX xavier-industrial with Jetpack 5.1.1: Reserved-memory： failed to reserve memory for node in device tree Jetson AGX Xavier boot , kernel , device-tree	2	863	November 6, 2023
NVRM: Xid MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_9 faulted @ 0x1_03e00000 Fault is of type FAULT_PTE ACCESS_TYPE_VIRT_READ CUDA Programming and Performance	2	2934	January 10, 2023
Reserve memory in device tree Jetson AGX Xavier	3	1336	October 18, 2021

Vpr-carveout causing kernel failure related to mc-err

Related topics