Multi GPU, Windows 10 pagefile and global memory issues

cryptogone.dev · July 5, 2018, 2:46pm

Hello,

Let’s consider a system with 8x 1080ti GPUs, running Windows 10

For each GPU I allocate a big buffer of 9GB via cudaMalloc() (so global memory), let’s call it ‘workBuffer’

Each GPU thens runs a kernel (1 block, 32 threads) that reads / write to workBuffer and produces as output 2048 bytes of data which the CPU then processes.

The CPU actually never needs to read or write workBuffer, only GPU does.

So given 8x 1080ti, this uses a total of 8 * 9 = 72GB of global memory.

On Windows 10, I am forced to increase the virtual memory size to ~80GB otherwise I will get an out of memory exception when calling cudaMalloc().

So this means that in order to run my program user is forced to have more than 80GB of free space on his harddrive, despite the fact that the host actually never needs to read or write into the 72GB of global memory used by my app, only the devices access this memory …

Is there a way to solve this, telling CUDA that this memory is only used by the devices to avoid that Windows allocates tons of virtual memory / disk space (pagefile.sys) for it ?

My first idea, since I use only one block, was to use shared memory, which can be shared by the threads but obviously this cannot work since shared mem is way too small for a 9GB buffer…

I think the only solution left would be to use a big 1D surface of size 9GB, do you guys think this could work ? (so giving roughly the same performance as using global memory and not causing windows to neeed a huge pagefile).

Thanks for any insights !

Robert_Crovella · July 5, 2018, 3:03pm

You could try doing an in-kernel malloc operation. You would have to substantially increase the size of the device heap.

It’s possible that this will also impact the virtual memory reservation, but my sense is that it might not. You would have to try it to see. It might be that when you do the cudaDeviceSetLimit operation, it would do a virtual reservation similarly to what you are experiencing now.

[url]https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#dynamic-global-memory-allocation-and-operations[/url]

The backing store for a surface would have to use one of the members of the cudaMalloc family (AFAIK), so I’m not sure how/why a surface allocation would be any different than an ordinary flat allocation.

cryptogone.dev · July 5, 2018, 6:33pm

Thanks a lot for the quick answer.

So I tried the in-kernel malloc operation on a GTX960 with 4GB of memory using the technique described here :
[url]https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#per-thread-block-allocation[/url]

I noticed two things:

I was able to malloc() a big chunk of memory in kernel and share it between threads, however it seems that I can only use up to ~50% of the device’s memory (2GB) before getting a “too many resources requested for launch” CUDA exception.
The virtual memory is also reserved when using this technique :-( so Windows will still show more reserved memory in task manager / eat up hard drive space for pagefile.

I will eventually try the same on the 1080Tis when I’ll have access to those but I guess I will get the same kind of result… so this is probably a dead end.

But at least I learned some new things trying this ;-)

Thanks again !

spudz76 · July 31, 2018, 2:25pm

Run Linux, it doesn’t need any of that BS with swap-backing-ram

Really a windows bug.

Topic		Replies	Views
Global memory access bottleneck CUDA Programming and Performance	8	3446	September 4, 2015
How do I increase the VRAM capacity programmatically? CUDA Programming and Performance	4	2052	October 12, 2021
How much global memory should I allocate? Framebuffer questions CUDA Programming and Performance	2	1960	February 3, 2009
cudaMallocManaged() not allocating memory in device memory CUDA Programming and Performance	4	2017	August 22, 2018
Windows 10 using ~1 GB of memory for all GPUs (WDDM) CUDA Programming and Performance	3	5797	October 22, 2017
Device memory size CUDA Programming and Performance	11	46849	June 6, 2008
Sharing GPU global memory with multiple CPU threads CUDA Programming and Performance	5	2723	February 26, 2019
Why does a simple single-threaded CUDA kernel consume large amounts of global memory? CUDA Programming and Performance	7	6564	February 24, 2011
Using multiple GPUs to scale an existing Cuda application - failing to allocate memory CUDA Programming and Performance	5	1163	September 4, 2018
how to get more global memory available CUDA Programming and Performance	5	1026	June 18, 2013

Multi GPU, Windows 10 pagefile and global memory issues

Related topics