We are running into what looks like an NVidia driver bug. When running KataGo, an open-source replica of AlphaZero, sometimes when we terminate the application with Ctrl+C it will hang and cannot be killed (even with kill -9), and any other application attempting to use the GPU (e.g. nvidia-smi) will also hang. Concurrently with this, we see a kernel OOPS reported in dmesg, with a NULL pointer dereference in the NVidia driver. For example:
We have replicated this problem on RTX A6000 and RTX A4000 GPUs, and on drivers 510.54 and 470.103. We are running Ubuntu 20.04, kernel 5.4.0-107-generic #121-Ubuntu SMP with drivers installed from the Lambda Stack. We run our application in Docker, with nvidia-container-toolkit. I’ve attached an nvidia-bug-report.log.gz from one of our machines, running 510.54, as well as a complete dmesg from one of the OOPSs.
We have replicated this issue on a system with GTX 1080 Ti and Asus motherboard. So it does not seem specific to the platform in any way. Driver is 510.54, kernel is 5.4.0-107-generic #121-Ubuntu SMP.
nvidia-bug-report hangs and I cannot reboot this server yet as it has a critical task still running, but attaching the partial output. nvidia-bug-report.log.gz (3.6 KB)
@AdamGleaveUCB
Please confirm if you are running the same application on all platforms.
If yes, please help to provide reliable repro steps so that I can try to reproduce issue locally which will help in debugging.
Unfortunately the issue is stochastic: it occurs sometimes, but not always, when we kill a running KataGo instance. It is more likely to occur the longer it’s been running.
I’ll look into if I can at least automate this (e.g. running and killing it in a loop until it hangs), or find some combination of settings that more reliably triggers the issue.
Just wanted to check in to see if you’ve been able to reproduce it with the new code? Do let us know if you have any trouble, happy to provide more details.
@AdamGleaveUCB
Thanks for sharing the code, however I ran it on notebook or system with coupled of GPUs connected but it failed as below -
root@oemqa-ThinkPad-P1-Gen-3:~/katago-driver-bug-repro# bash loop.sh
docker-compose version 1.25.0, build unknown
*** Iteration 0 ***
Starting Docker compose
Waiting for 45 seconds
…
Done waiting.
Trying to bring docker service down now.
If this hangs, then bug detected!
ERROR: The Compose file ‘./compose/crash.yml’ is invalid because:
services.selfplay.deploy.resources.reservations value Additional properties are not allowed (‘devices’ was unexpected)
services.selfplay.build contains unsupported option: ‘target’
services.selfplay.volumes contains an invalid type, it should be a string
services.selfplay.volumes contains an invalid type, it should be a string
services.selfplay.volumes contains an invalid type, it should be a string
*** Iteration 1 ***
Starting Docker compose
Waiting for 45 seconds
…^C
root@oemqa-ThinkPad-P1-Gen-3:~/katago-driver-bug-repro#
Please confirm if I must need 7 GPUs to run code successfully.
It would be great and easier to have code which can trigger issue with couple of GPUs connected.
I have modified the code to run on fewer GPUs: you can now run bash loop.sh <n> [time] where n can be 2, 3 or 7 GPUs and time is the timeout (defaults to 60s, but you may want to increase it if you don’t see GPU utilization occurring, as I’ve found the start-up time varies depending on how powerful the machine is). You’ll need to rebuild the Docker image to pick up the new configs if running on a system you’ve already run this on: e.g. docker-compose -f compose/crash2.yml --env compose/crash2.env build.
However unfortunately the issue is much harder to reproduce on fewer GPUs. I ran bash loop.sh 3 for 100 iterations without error. However, this issue has occurred at least once with 3 GPUs in-the-wild. So, if you can test it on a larger machine that’d be best, but if this is significantly harder then you could just leave bash loop.sh running in the background (you’ll want to increase the # of iterations from 100) and it should replicate it given enough time.
I’d be happy to give you temporary access to one of our 8-GPU servers to replicate this, though I imagine you need internal infra to debug this properly.
I left it running for four days with three GPUs and was unfortunately unable to replicate this issue, so the number of GPUs unfortunately does seem critical for this test case. Will look into whether I can modify the test case to be more reliable with fewer GPUs, but if the bug involves some race condition in the driver this may be much less likely to occur with a limited number of GPUs.
I tried on a system with 4 x T4 cards but could not reproduced issue so far. Here is the output captured after running script shared on GitHub.
If this hangs, then bug detected!
ERROR: .FileNotFoundError: [Errno 2] No such file or directory: ‘./compose/crash4.yml’
*** Iteration 100 ***
Starting Docker compose
Waiting for 60 seconds
…
Done waiting.
Trying to bring docker service down now.
Thanks for the attempt to replicate. As the output shows, there’s no compose/crash4.yml file, so the test never ran. The script can’t run on an arbitrary number of GPUs, as we have to provide a different config file depending on the number of GPU devices. Sorry, this should have been documented more clearly.
I’ve added a 4-GPU config now. If running on the same machine again you’ll need to rebuild the Docker image to have it include the new config, e.g:
We did see this error occur once in the wild on a 4-GPU machine, so replication should be possible, but it seems much lower frequency than with more GPUs. So running it on an 8-GPU machine would still be preferable if you have access to one.
To check things are running correctly, you can look in the logs in bug-repro-logs/active/compose.{stdout,stderr}.