CUDA Out of Memory on GKE with Time-Sharing

Hello,

We are experiencing persistent CUDA out of memory errors on a GKE node equipped with an NVIDIA T4 GPU, configured with time-sharing. This issue was first raised with Google Cloud Support, and they have suggested we consult the NVIDIA forums as it may be a driver-level problem.

The core issue is that we receive OOM errors when creating CUDA contexts, even though our monitoring shows that only about 30% of the GPU’s VRAM is in use.


Environment

  • Cloud Platform: Google Kubernetes Engine (GKE)

  • GPU: 1x NVIDIA Tesla T4 (16 GB VRAM)

  • GPU Sharing Strategy: GKE Time-Sharing (with max-shared-clients-per-gpu=48)

  • NVIDIA Driver Version: 570.133.20 (also confirmed the same issue on 535.247.01)

  • Node OS Image: Container-Optimized OS cos-117-18613-263-14


Problem Description

Our service provides screen recording for video conferencing applications like Google Meet. We run multiple recording sessions in parallel on a single T4 GPU using GKE’s time-sharing feature.

The out of memory error consistently appears when we scale our application to around 15 Pods on a single node. Each pod runs a screen recording process using the gpu-screen-recorder tool, which leverages NVENC.


The Contradiction: OOM Errors vs. Monitoring Data

While the application and kernel logs report an Out of Memory condition, all our monitoring tools indicate ample available resources.

  • Google Cloud Monitoring: Reports GPU Memory Usage at only ~30% when the errors occur.

  • DCGM Exporter: Metrics for SM utilization and memory usage also show low consumption, with plenty of capacity available.

This suggests the OOM error is not related to a lack of VRAM capacity itself, but perhaps another resource limit.


Error Logs

We have captured the following error messages:

  1. Kernel Log (dmesg on the GKE node):

    [ 361.456869] NVRM: nvCheckOkFailedNoLog: Check failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051) returned from _memdescAllocInternal(pMemDesc) @ mem_desc.c:1353
    
    
  2. Application Error Log (from within the container):

    [screen recorder] gsr error: gsr_cuda_load failed: unable to create CUDA context, error: out of memory (result: 2)
    gsr error: gsr_get_supported_video_codecs_nvenc: failed to load cuda
    Error: failed to query for supported video codecs
    
    

    The key failure is the inability to create a new CUDA context.


Our Questions

  1. Why would the NVIDIA driver report an Out of Memory error when VRAM usage is only at 30%?

  2. Is there a hard limit on the number of concurrent CUDA contexts that can be created in a time-sharing configuration, which is separate from the total available VRAM?

  3. Could this be a potential bug in the driver’s memory management or resource scheduling when used with GKE’s time-sharing scheduler?

Any insights or suggestions would be greatly appreciated. Thank you.