Multi GPU, Windows 10 pagefile and global memory issues

Hello,

Let’s consider a system with 8x 1080ti GPUs, running Windows 10

For each GPU I allocate a big buffer of 9GB via cudaMalloc() (so global memory), let’s call it ‘workBuffer’

Each GPU thens runs a kernel (1 block, 32 threads) that reads / write to workBuffer and produces as output 2048 bytes of data which the CPU then processes.

The CPU actually never needs to read or write workBuffer, only GPU does.

So given 8x 1080ti, this uses a total of 8 * 9 = 72GB of global memory.

On Windows 10, I am forced to increase the virtual memory size to ~80GB otherwise I will get an out of memory exception when calling cudaMalloc().

So this means that in order to run my program user is forced to have more than 80GB of free space on his harddrive, despite the fact that the host actually never needs to read or write into the 72GB of global memory used by my app, only the devices access this memory …

Is there a way to solve this, telling CUDA that this memory is only used by the devices to avoid that Windows allocates tons of virtual memory / disk space (pagefile.sys) for it ?

My first idea, since I use only one block, was to use shared memory, which can be shared by the threads but obviously this cannot work since shared mem is way too small for a 9GB buffer…

I think the only solution left would be to use a big 1D surface of size 9GB, do you guys think this could work ? (so giving roughly the same performance as using global memory and not causing windows to neeed a huge pagefile).

Thanks for any insights !

You could try doing an in-kernel malloc operation. You would have to substantially increase the size of the device heap.

It’s possible that this will also impact the virtual memory reservation, but my sense is that it might not. You would have to try it to see. It might be that when you do the cudaDeviceSetLimit operation, it would do a virtual reservation similarly to what you are experiencing now.

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#dynamic-global-memory-allocation-and-operations

The backing store for a surface would have to use one of the members of the cudaMalloc family (AFAIK), so I’m not sure how/why a surface allocation would be any different than an ordinary flat allocation.

Thanks a lot for the quick answer.

So I tried the in-kernel malloc operation on a GTX960 with 4GB of memory using the technique described here :
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#per-thread-block-allocation

I noticed two things:

  1. I was able to malloc() a big chunk of memory in kernel and share it between threads, however it seems that I can only use up to ~50% of the device’s memory (2GB) before getting a “too many resources requested for launch” CUDA exception.

  2. The virtual memory is also reserved when using this technique :-( so Windows will still show more reserved memory in task manager / eat up hard drive space for pagefile.

I will eventually try the same on the 1080Tis when I’ll have access to those but I guess I will get the same kind of result… so this is probably a dead end.

But at least I learned some new things trying this ;-)

Thanks again !

Run Linux, it doesn’t need any of that BS with swap-backing-ram

Really a windows bug.