Unified Memory Limits?

To my delight, I recently discovered that UMA finally has resonable performance on Windows if you have a PCIe 4.0 system. I don’t have one, but my project sponsor does.

I have a notebook with a GTX 1660 Ti with 6GB memory, and the Windows 10 host has 16GB of memory and an 8GB page file. The largest UMA array I can allocate is about 5GB. Another system has two GTX 1080 TIs with 11GB memory each, and the Windows 10 host has 32GB of memory with a 32GB page file. On this system I can allocate a UMA array of 15GB. Both systems use CUDA version 11.

I am confused about (1) why I can’t oversubscribe on the 1660, and (2) why I can oversubscribe on the 1080, but only to 15GB.

Does anybody know what factor(s) limit the maximum UMA allocation size?

Update: If I hide one of the 1080s, it becomes clear that oversubscription is NOT work on that system either. It appears that the subset of the UMA functionarly of Windows is faster with PCIe 4.0, you still can use oversubscription because they still haven’t implemented page faulting.

Are you sure about this. Can you check your task manager for shared memory
or your are mixing with gpu memory.
I can allocate 5.6 GB array on gpu memory wether Windows or WSL
Unified Memory doesn’t work under windows whatever the code compile
but it doesn’t allocate anything in shared. Under WSL, IT works i can allocate 5.6GB
of gpu memory and 1.6 GB only of shared nothing more than 1.6GB/

By the way did you notice some load distribution between the cpu and the gpu
now in unified memory the cpu works more and multithreading amazing
there is an overall performance gain.

Thanks for your comment. As you have observed, UMA is pretty useless on Windows. And using WSL doesn’t help either because eventually you end up in the Windows driver, not the LInux driver.

Since I first posted my question, I was able to create an Ubuntu system on a portable drive. I was then able to run Linux on the machine at my customer site where they have a PCIe 4.0 system with an RTX 3xxx. Then we upgraded the CPU memory from 16GB to 32GB, and are now able to run some particle analysis simulations of around 50,000,000 to 100,000,000 particles. (There is no way this would ever fit in the 12GB of GPU memory.)

Currently we’re fine-tuning this to figure out which particle pieces of data we can put in UMA without causing trashing.

The load distribution you mentioned is that CUDA is creating one process per core when you use UMA, and the first process uses 100% of one core. I presume that each process is monitoring for data modifications, but this is a really strange implemenation. They should be arrested by the Computer Science Police.

Thank you for your comment I really appreciated. It is no where on the net we can
find and share knowledge about CUDA.
I used chapter 2 of
Jaegeun Han, Bharatkumar Sharma - Learn CUDA Programming_ A beginner’s guide to GPU programming and parallel computing with CUDA 10.x and C_C++ (2019, Packt Publishing) - libgen.lc

and then from deviceQuery
under Windows and WSL: the difference is
Device supports Managed Memory: Yes
Both of them supports UVA.

I need to understand better your first point
" UMA is pretty useless on Windows. And using WSL doesn’t help either because eventually you end up in the Windows driver, not the LInux driver. "

when compiling unifiedmemory.cu from the above mentioned book with some modification go to the upper possible limit
I see really a difference

in the first case nothing then in the second case under windows GPU load and GPU
memory usage in the third case under WSL GPU load and memory usage is lower
in the task manager (windows) CPU is working more and the GPU behaves differently

as you can see in the first part the GPU memory usage is 1.6 while in the second(Last part)
the shared memory 1.6 is used not the GPU. But it is limited, I can not go beyond
1.6G on shared. so UMP is working but limited.

It is interseting that Unified Memory is faster as you can it takes longer on the GPU
memory while in the second case faster just a blast in the shared memory

Here the example that I used slightly modefied from the original

#include
#include <math.h>

// CUDA kernel to add elements of two arrays
global
void add(int n, float *x, float *y)
{
int index = blockIdx.x * blockDim.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;
for (int i = index; i < n; i += stride)
y[i] = x[i] + y[i];
}

int main(void)
{
int N = 153500000; //1<<27; //29 max
float *x, *y;

// Allocate Unified Memory – accessible from CPU or GPU
cudaMallocManaged(&x, Nsizeof(float));
cudaMallocManaged(&y, N
sizeof(float));

// initialize x and y arrays on the host
for (int i = 0; i < N; i++) {
x[i] = 1.0f;
y[i] = 2.0f;
}

// Launch kernel on 1M elements on the GPU
int blockSize = 256;
int numBlocks = (N + blockSize - 1) / blockSize;
add<<<numBlocks, blockSize>>>(N, x, y);

// Wait for GPU to finish before accessing on host
cudaDeviceSynchronize();

// Check for errors (all values should be 3.0f)
float maxError = 0.0f;
for (int i = 0; i < N; i++)
maxError = fmax(maxError, fabs(y[i]-3.0f));
std::cout << "for N = " << N << " Max error: " << maxError << std::endl;

// Free memory
cudaFree(x);
cudaFree(y);

return 0;
}

my only modification is the line

int N = 153500000; //1<<27; //29 max

to see things clearly.

Also , from CUDA-samples
UnifiedMemoryPerf is behaving different in terms of speed here under WSL


while under windows the first 3 cols are completly different

Thanks really I was happy to see someone noticed UMP

By “useless under Windows” I meant that you can’t oversubscribe your memory allocations. That means that if the customer’s data won’t quite fit into the GPU memory, you must either upgrade to a better (and much more expensive) GPU, or spend a huge amount of time modifying your software to somehow make it fit.

I solved our particular problem using the existing GPU RTX3060 and by adding $150 of memory to the CPU. For our particular application, this increased the number of particles we could work with from about 25,000,000 to 100,000,000. The only software change was to enable the use of UMA for certain vectors, and I already had that code from a few years ago.

WSL is very cool and I have been using it a lot recently. It is wonderful being able to build and test the Linux version of our app without having to explicitly start a Linux VM or move to a Linux machine.

However, WSL itself is a virtual machine. Yes, you can run CUDA apps there, but the CUDA calls in this evironment go to a special WSL CUDA driver, which in turn calls the Windows CUDA driver, not the native Linux driver (so no real UMA).

If you were to run a test that overscribes memory (which you can’t do on Windows), you would see that you have a performance penalty when data is migrated (automatically) between the CPU and GPU because, of course, there is some overhead in the copy. But as I think I commented in my original post, if you have PCIe 4.0, this is no longer horrible.

Thank you , WSL is a virtual machine a nice virtual machine

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.