Unhandled level 2 translation fault while using CUDA

I am experiencing an intermittent “unhandled level 2 translation fault” which results in a kernel backtrace within syslog. The log message starts out with:

unhandled level 2 translation fault (11) at 0x00000000, esr 0x92000046

Followed by the usual register dump & stack info. The error occurs within the context of a process which is using CUDA-enabled TensorFlow on the TX2.

My question is:

Is this necessarily a CUDA/mmapi/gpu-driver level bug, or is this type of fault possibly the result of a misbehaving CUDA kernel? I am not familiar enough with low-level CUDA programming and the security model enforced by the CUDA runtime. Is it expected that a bug in CUDA “application” code could cause such a VM translation fault, or is this necessarily a driver fault (or less-likely, hardware itself)?

The reason for my question is:

If this could be caused by tensorflow or cuda application code, I will look in to upgrading the version of tensorflow, or dig in to the way it is using nvcc to compile CUDA code. If it cant be the CUDA application at fault, I will look towards updating BSP level driver code and/or kernel settings.

Thanks!

I couldn’t tell you the actual original cause of the problem, but that is just an attempt to dereference a NULL pointer in the kernel. Chances are it has something to do with the VM, but that is just a guess.

@linuxdev that’s right. The question is whether even poorly written or malicious CUDA application code should be able to trigger such a translation fault within the kernel, or if this is necessarily a problem with the CUDA/GPU drivers themselves.

That I could not answer.

Hi,

unhandled level 2 translation fault’ can be reproduced with a user space app.

There was a similar error occurs from a NULL pointer.
The root cause is the kernel driver and CUDA toolkit doesn’t match.

Do you install all the package from the same JetPack version?
Please also double check if your TensorFlow package is built with the same JetPack as you used.

Thanks.