Hello,
Let’s consider a system with 8x 1080ti GPUs, running Windows 10
For each GPU I allocate a big buffer of 9GB via cudaMalloc() (so global memory), let’s call it ‘workBuffer’
Each GPU thens runs a kernel (1 block, 32 threads) that reads / write to workBuffer and produces as output 2048 bytes of data which the CPU then processes.
The CPU actually never needs to read or write workBuffer, only GPU does.
So given 8x 1080ti, this uses a total of 8 * 9 = 72GB of global memory.
On Windows 10, I am forced to increase the virtual memory size to ~80GB otherwise I will get an out of memory exception when calling cudaMalloc().
So this means that in order to run my program user is forced to have more than 80GB of free space on his harddrive, despite the fact that the host actually never needs to read or write into the 72GB of global memory used by my app, only the devices access this memory …
Is there a way to solve this, telling CUDA that this memory is only used by the devices to avoid that Windows allocates tons of virtual memory / disk space (pagefile.sys) for it ?
My first idea, since I use only one block, was to use shared memory, which can be shared by the threads but obviously this cannot work since shared mem is way too small for a 9GB buffer…
I think the only solution left would be to use a big 1D surface of size 9GB, do you guys think this could work ? (so giving roughly the same performance as using global memory and not causing windows to neeed a huge pagefile).
Thanks for any insights !