CUDA_NOT_PERMITTED_ERROR Video Decoding/Encoding with MPS

We have several applications that decode and encode video by creating and destroying CUDA contexts for that purpose.

Quite randomly they crash with a CUDA_NOT_PERMITTED_ERROR while trying to assign a CUDA context (generated with pycuda) as the codec context (we are using pyAV).

[AVHWDeviceContext @ 0x7f633f903c80] cu->cuCtxCreate(&hwctx->cuda_ctx, desired_flags, hwctx->internal->cuda_device) failed -> CUDA_ERROR_NOT_PERMITTED: operation not permitted

which is generated by the following partial stack trace:

stream.codec_context.cuda_ctx = cuda_ctx.context
File "av/video/codeccontext.pyx", line 266, in
File "av/video/codeccontext.pyx", line 229, in
RuntimeError: No hw_device_ctx nor hw_frames_ctx specified

We are using MPS in Kubernetes (correctly exposing ipcHost in containers). We are not overpassing the context limits from MPS. We are unable to reproduce the error by:

  • Creating and destroying contexts repeatedly.
  • Assigning a low CUDA_MPS_ACTIVE_THREAD_PERCENTAGE to the containers.
  • Forcing OOM by creating sufficient encoding/decoding sessions.

Logs in MPS show normal usage:

[2023-06-13 21:28:21.593 Other     7] Volta MPS Server: Received new client request
[2023-06-13 21:28:21.593 Other     7] MPS Server: worker created
[2023-06-13 21:28:21.593 Other     7] Volta MPS: Creating worker thread
[2023-06-13 21:28:21.593 Other     7] Volta MPS: Device NVIDIA GeForce RTX 3080 (uuid 0xd47fc320-0x88813162-0x9dc394c4-0x67c70424) is associated
[2023-06-13 21:28:21.734 Other     7] Receive command failed, assuming client exit
[2023-06-13 21:28:21.735 Other     7] Volta MPS: Client disconnected. Number of active client contexts is 6.
[2023-06-13 21:28:22.937 Control     1] Accepting connection...
[2023-06-13 21:28:22.937 Control     1] NEW CLIENT 0 from user 0: Server already exists

What could be the reason for the previous error?