[CUDA][WSL2][MPS] CUDA OOM with >14 MPI ranks under WSL2; MPS not starting

Hello,

I am running an MPI-based CUDA application under WSL2 (Ubuntu) with the following setup:

  • GPU: NVIDIA RTX A6000 (48GB)

  • Driver (WSL): 535.247.01 (Driver Version: 531.14, CUDA 12.1)

  • Driver (Windows host): 531.4

  • CUDA Toolkit (WSL): 11.6 (nvcc release 11.6, V11.6.55)

  • CUDA Toolkit (Windows): 12.1 (nvcc release 12.1, V12.1.105)

  • MPI: MPICH


Problem

  • Works fine with 14 MPI ranks.

  • With 15 ranks or more, CUDA reports out of memory (OOM) even though nvidia-smi shows usage far below 48GB.

Observations

  • With 20 ranks, if only half the ranks allocate memory, OOM still occurs.

  • When splitting into two groups:

    • First 10 ranks allocate memory → GPU memory grows modestly (from ~5412MB to ~8387MB).

    • Second 10 ranks then attempt allocation → immediate OOM.

  • Suggests the issue is driver/context overhead or a WSL2 limitation, not physical memory exhaustion.

MPS issue

  • Tried enabling MPS in WSL2:

    • Started nvidia-cuda-mps-control -dcontrol socket appears.

    • No nvidia-cuda-mps-server process ever starts.

    • server.log is empty.

    • nvidia-smi never shows “Compute MPS Active”.

  • Seems MPS cannot properly start under WSL2.


Questions

  1. Why does OOM occur with >14 ranks despite unused GPU memory?

  2. Is this a known CUDA/WSL2 limitation (per-context memory reservation, driver behavior, etc.)?

  3. Is there a reliable way to allow more MPI ranks to share the GPU under WSL2 (e.g., via MPS)?

  4. How can I confirm whether MPS is supported/running in WSL2?

Any insights would be very helpful. Thank you!