Description
We’re trying to apply an Amazon Deep Learning AMI to an EMR cluster. We get this error: “ERROR: An NVIDIA kernel module ‘nvidia’ appears to already be loaded in your kernel. This may be because it is in use (for example, by an X server, a CUDA program, or the NVIDIA Persistence Daemon), but this may also happen if your kernel was configured without support for module unloading. Please be sure to exit any programs that may be using the GPU(s) before attempting to upgrade your driver. If no GPU-based programs are running, you know that your kernel supports module unloading, and you still receive this message, then an error may have occured that has corrupted an NVIDIA kernel module’s usage count, for which the simplest remedy is to reboot your computer.”
We need a clean procedure for how to fix this or work around the error.
Environment
TensorRT Version: Not sure
GPU Type: Whatever is in AWS machines like P3, P4, G3, G4
Nvidia Driver Version: Not sure (Amazon docs don’t include this info)
CUDA Version: Not sure (Amazon docs don’t include this info)
CUDNN Version: Not sure (Amazon docs don’t include this info)
Operating System + Version: Amazon Linux AMI 2018.03
Python Version (if applicable): Python 3.7.3
TensorFlow Version (if applicable): 2.4.1
PyTorch Version (if applicable): N/A
Baremetal or Container (if container which image + tag): Not sure
Relevant Files
Minimal bug report scenario is described here: GitHub - dgoldenberg-audiomack/nvidia-issue-1
Steps To Reproduce
Minimal bug report scenario is described here: GitHub - dgoldenberg-audiomack/nvidia-issue-1
Includes: steps to set up, steps to run, log file snippet with the error.