Recently we upgraded our AGX Xavier(64GB version) to L4T 35.3.1 (JetPack 5.1.1) on CTI rogue, and we noticed that a simple code of allocate 55GB of vram via cudaMalloc, then try to write 4k to each of the 1MB block(just as a quick way to see where we see error), it will 100% time trigger an error in the application(wrong argument), with kernel logs like this:
1 Oct 09 12:30:21 argus-026-040-0666-01 kernel: mc-err: (255) csw_nvl4w: EMEM address decode error 2 Oct 09 12:30:21 argus-026-040-0666-01 kernel: mc-err: status = 0x200100bb; addr = 0x68570400; hi_adr_reg=0x0 3 Oct 09 12:30:21 argus-026-040-0666-01 kernel: mc-err: secure: no, access-type: write
This is accompanied by a GPU fault, confirming the illegal write is from the GPU.
I wonder if this is related to this issue. And if it is, what is the right way to disable it? Since the original dtb under /boot/dtb/ (referenced by /boot/extlinux/extlinux.conf) have it disabled
TIMEOUT 30
DEFAULT primary
MENU TITLE L4T boot options
LABEL primary
MENU LABEL primary kernel
LINUX /boot/Image
FDT /boot/dtb/kernel_tegra194-agx-cti-AGX101-JCB005-AVT-CSI2-4CAM.dtb
INITRD /boot/initrd
APPEND ${cbootargs} root=/dev/mmcblk0p1 rw rootwait rootfstype=ext4 mminit_loglevel=4 console=ttyTCU0,115200n8 console=tty0 fbcon=map:0 net.ifnames=0 rootfstype=ext4 sdhci_tegra.en_boot_part_access=1 video=efifb:off
when we run the test(allocate a single 55GB vram then write), there was enough free mem, and , the allocation was successful. The failure always come after some writing.
if we allocate smaller vram, it will still happen but not as reliably: even only 1GB can trigger it sometimes.
the reason we are doing this test is because we noticed a non-deterministic error happening to our live system. this is just a way to reproduce it.
our theory is that because of this carveout, some page of the allocation we get belongs to some memory area that is not actually accessible by the GPU.
that carveout is not present in R32 before we upgrade and the problem is not triggered there either.
Attached the code to reproduce
int memtest(int sz_g) {
uint8_t *data = new uint8_t[4 * 1024];
for (int i = 0; i < 4 * 1024; i++) {
data[i] = i % 255;
}
uint8_t *d_data;
size_t sz = sz_g * 1024;
cudaError_t err = cudaMalloc((void**)&d_data, sz * 1024 * 1024);
if (err != cudaSuccess) {
std::cerr << "cudaMalloc failed: " << cudaGetErrorString(err) << std::endl;
return -1;
}
for (size_t i = 0; i < sz; i++) {
err = cudaMemcpy(d_data + i * 1024 * 1024, data, 4 * 1024, cudaMemcpyHostToDevice);
if (err != cudaSuccess) {
std::cerr << "cudaMemcpy failed at iteration " << i << ": " << cudaGetErrorString(err) << std::endl;
// cudaFree(d_data);
return -1;
}
}
printf("cudaMemcpy success for %d GB\n", sz_g);
return 0;
}
At some point we were getting a VRP error, similar to the one you are experiencing, when using the AV1 HW encoder. Although I don’t believe is the same one:
Thanks. That successfully disabled the carveout but I am still seeing the failure. Can you help? the reproduction code is really simple(provided above).