I am experiencing an intermittent “unhandled level 2 translation fault” which results in a kernel backtrace within syslog. The log message starts out with:
unhandled level 2 translation fault (11) at 0x00000000, esr 0x92000046
Followed by the usual register dump & stack info. The error occurs within the context of a process which is using CUDA-enabled TensorFlow on the TX2.
My question is:
Is this necessarily a CUDA/mmapi/gpu-driver level bug, or is this type of fault possibly the result of a misbehaving CUDA kernel? I am not familiar enough with low-level CUDA programming and the security model enforced by the CUDA runtime. Is it expected that a bug in CUDA “application” code could cause such a VM translation fault, or is this necessarily a driver fault (or less-likely, hardware itself)?
The reason for my question is:
If this could be caused by tensorflow or cuda application code, I will look in to upgrading the version of tensorflow, or dig in to the way it is using nvcc to compile CUDA code. If it cant be the CUDA application at fault, I will look towards updating BSP level driver code and/or kernel settings.