What are the pinned memory limitations on CUDA for WSL2?

I understand there are limits on amount of pinned memory when running WSL2. Would like to know the details on what these limits are. This is for an Ampere card.

The maximum available amount of pinned host memory depends on internal details of the operating system, not CUDA. cudaHostAlloc is merely a thin wrapper around OS API calls. To date I am not aware of any formula that would let one determine the maximum amount of available pinned memory up front, for any operating system supported by CUDA. You might want to inquire with the vendor of WSL2.

I will note that pinning memory to physical addresses is somewhat anathema to virtual memory management which underpins all modern operating systems. The idea behind providing pinned allocations is to support the creation of buffers of modest size (in the MB range) suitable for DMA, in particular for use by device drivers. Pinning large portions of system memory is not part of OS design philosophy and may cause issues, such as slowing down system memory allocations.

Thanks. In my case, pinned buffers would vary in size from 10 to 40 MB. This is for an image compression application. I will inquire further with WSL2 folks.

Pinning a few buffers of that size should not be a problem in my experience. Are you observing failed allocations?

Thanks, I was just doing some due diligence before trying out wsl2.
As it turns out, everything works fine on WSL2 for 10 MB input files.
Performance is equal to native windows performance, although considerably slower than Linux bare-metal performance for the exact same hardware.

Are you referring to the performance of the CUDA kernels themselves, or application level performance? I am assuming the latter, because kernel run-time should not be affected by the host platform. I am also assuming this is a controlled experiment, that is, you are booting either OS on the physically identical machine and the OS is the only variable that changes.

Before you jump into profiling (which you definitely would want to do) it is probably a good idea to review compiler switches and any configuration settings for the app to make sure they are in fact identical across WSL2 (a derivative of Ubuntu, best I know) and your other Linux distro.

Thanks - yes this is the exact same laptop booting in either Fedora Linux or Windows 11.
The software is exactly the same. And yes, the timing I mention is application level, not kernel level.
This application is an encoder, so I measure perf in ms per frame. On Fedora, I get 11 ms per frame, while on both Windows and WSL2, I get 19 ms per frame. I was expecting the WSL time to be lower.

There has to be a rational explanation for this timing discrepancy that should be identifiable.

If this were my program, the first thing I would do is to use the CUDA profiler to confirm that kernel run times are in fact the same (plus / minus a few percent measurement “noise”) on both OSes, as I would expect. If that is the case, the difference must be in host code, which can be timed and profiled.

I do not know how WSL2 is layered on top of Windows, but the default driver model under Windows suffers from some inefficiencies because the OS wants to maintain maximum control over the GPU. This affects kernel launch overhead in particular, which is higher than on Linux or with the alternate TCC driver. However, the CUDA driver driver tries to mitigate this overhead, and the difference should cause a performance degradation in the single digit percent, rather than a factor of almost 2x. Differences also exist in the speed of memory allocation and de-allocation (with WDDM, the OS is in charge of GPU memory allocation), but again, that should not result in such a massive difference.

You will need to some digging to get to the bottom of this. Best of luck.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.