I have several Ubuntu 16.04 docker nodes, running nvidia-docker. Each node runs several instances of an Emby container, and uses a Quattro P4000 for NVENC media transcoding, with ffmpeg (bundled with the Emby container).
I’ve observed that when transcoding enough concurrent streams to exhaust the GPU RAM (8GB), the host itself will hang, in some cases, cause the NICs to reset (Intel igbxe), requiring a hard reset to restore.
(I can supply nvidia-bug-report gathered at the time of the crash, if this helps)
I realize that I’m over-subscribing my GPU RAM under these conditions, but my user load is unpredictable, and I’d prefer that the entire system not fail as a result. I’m a noob - is there anything I should/could be doing to limit the impact of oversubscribing RAM, such that attempting more transcodes than I have RAM to support will simply result in an error, but not a catastrophic system failure?