Unhandled level 2 translation fault while using CUDA

jkeller · April 17, 2020, 4:08pm

I am experiencing an intermittent “unhandled level 2 translation fault” which results in a kernel backtrace within syslog. The log message starts out with:

unhandled level 2 translation fault (11) at 0x00000000, esr 0x92000046

Followed by the usual register dump & stack info. The error occurs within the context of a process which is using CUDA-enabled TensorFlow on the TX2.

My question is:

Is this necessarily a CUDA/mmapi/gpu-driver level bug, or is this type of fault possibly the result of a misbehaving CUDA kernel? I am not familiar enough with low-level CUDA programming and the security model enforced by the CUDA runtime. Is it expected that a bug in CUDA “application” code could cause such a VM translation fault, or is this necessarily a driver fault (or less-likely, hardware itself)?

The reason for my question is:

If this could be caused by tensorflow or cuda application code, I will look in to upgrading the version of tensorflow, or dig in to the way it is using nvcc to compile CUDA code. If it cant be the CUDA application at fault, I will look towards updating BSP level driver code and/or kernel settings.

Thanks!

linuxdev · April 17, 2020, 4:55pm

I couldn’t tell you the actual original cause of the problem, but that is just an attempt to dereference a NULL pointer in the kernel. Chances are it has something to do with the VM, but that is just a guess.

jkeller · April 17, 2020, 5:34pm

@linuxdev that’s right. The question is whether even poorly written or malicious CUDA application code should be able to trigger such a translation fault within the kernel, or if this is necessarily a problem with the CUDA/GPU drivers themselves.

linuxdev · April 17, 2020, 5:37pm

That I could not answer.

AastaLLL · April 30, 2020, 7:08am

Hi,

‘unhandled level 2 translation fault’ can be reproduced with a user space app.

There was a similar error occurs from a NULL pointer.
The root cause is the kernel driver and CUDA toolkit doesn’t match.

Do you install all the package from the same JetPack version?
Please also double check if your TensorFlow package is built with the same JetPack as you used.

Thanks.