We have several applications that decode and encode video by creating and destroying CUDA contexts for that purpose.
Quite randomly they crash with a CUDA_NOT_PERMITTED_ERROR while trying to assign a CUDA context (generated with pycuda) as the codec context (we are using pyAV).
[AVHWDeviceContext @ 0x7f633f903c80] cu->cuCtxCreate(&hwctx->cuda_ctx, desired_flags, hwctx->internal->cuda_device) failed -> CUDA_ERROR_NOT_PERMITTED: operation not permitted
which is generated by the following partial stack trace:
... stream.codec_context.cuda_ctx = cuda_ctx.context File "av/video/codeccontext.pyx", line 266, in av.video.codeccontext.VideoCodecContext.cuda_ctx.__set__ File "av/video/codeccontext.pyx", line 229, in av.video.codeccontext.VideoCodecContext.get_cuda_hwctx RuntimeError: No hw_device_ctx nor hw_frames_ctx specified
We are using MPS in Kubernetes (correctly exposing
ipcHost in containers). We are not overpassing the context limits from MPS. We are unable to reproduce the error by:
- Creating and destroying contexts repeatedly.
- Assigning a low
CUDA_MPS_ACTIVE_THREAD_PERCENTAGEto the containers.
- Forcing OOM by creating sufficient encoding/decoding sessions.
Logs in MPS show normal usage:
[2023-06-13 21:28:21.593 Other 7] Volta MPS Server: Received new client request [2023-06-13 21:28:21.593 Other 7] MPS Server: worker created [2023-06-13 21:28:21.593 Other 7] Volta MPS: Creating worker thread [2023-06-13 21:28:21.593 Other 7] Volta MPS: Device NVIDIA GeForce RTX 3080 (uuid 0xd47fc320-0x88813162-0x9dc394c4-0x67c70424) is associated [2023-06-13 21:28:21.734 Other 7] Receive command failed, assuming client exit [2023-06-13 21:28:21.735 Other 7] Volta MPS: Client disconnected. Number of active client contexts is 6. [2023-06-13 21:28:22.937 Control 1] Accepting connection... [2023-06-13 21:28:22.937 Control 1] NEW CLIENT 0 from user 0: Server already exists
What could be the reason for the previous error?