Obvious Memory Access Error in 460.x driver (Patch provided)

Issue:
Mapping IO(Physical) address 0xF8610000 causes system reboot due to:
BUG: unable to handle kernel paging request at 00000000f8610000

It requires setting up netconsole and “sysctl kernel.panic_on_oops=0” to get the error message.
No way to run bug-report.sh while kernel hit oops…

But the fix is quiet strait forward:
By looking up “RIP: 0010:os_lookup_user_io_memory+0x3e1” in the Opps message with GDB and nvidia.ko
I’ve got the location of the issue: nvidia/os-mlock.c:59

I believe the original intention was to check if mapping physical/bus address is contiguous.
But (*pte_array)[i] or (*pte_array)[i-1] means to access the Physical Address as Kernel space virtual address…
Although the type of pte_array is “NvU64 **” but it’s value was assign with:
“pte_array[i] = (NvU64 *)(pfn << PAGE_SHIFT);”

Each element of pte_array is actually a PHYSICAL/BUS address (a 64bits integer) cast as (NvU64* [Pointer of 64bits integer]), which SHOULD NOT be dereference directly…

nvidia-460-fix-invalid-memory-access.patch (468 Bytes)

Put it more simple:
(*pte_array)[i] means: pte_array[0][i]
and
(*pte_array)[i - 1] means: pte_array[0][i - 1]

Each iteration of the loop assigns pte_array[i]
comparing pte_array[0][i] and pte_array[0][i - 1] is an obvious error.

Thanks for reporting this. We’re tracking it in internal bug number 3280454. While the bug tracker isn’t public, you can use this number to refer to this issue in future correspondence.

1 Like

I’ve noticed that there are some new versions of Linux drivers released a few days ago, but the issue is still there. Is there an expected time that this issue would be addressed?

Thanks.

Latest Production Branch Version: 460.73.01
Latest New Feature Branch Version: 465.24.02