gpu swapping

Is there any swapping mechanism for gpu? Or the process will be definitely killed by reaching the memory limit? If there isn’t, can user enable such feature? If it is enabled, can a non root user disable that?

A pascal or volta GPU running in linux OS can have its memory “oversubscribed”. In that case, the GPU runtime will swap pages of memory as needed between host and device. In order to take advantage of this, the memory must be allocated with a managed allocator, such as cudaMallocManaged

Does that mean for maxwell and olders, the process will only be killed?

Moreover, what about binary files running on gpu, assume we have only an executable. How can we find out that swapping will be used or not for pascal and volta?

The process won’t be killed (at least, not by the CUDA runtime). GPU memory is allocated using a function like cudaMalloc

If you request more than what is available, cudaMalloc will return an error. Beyond that, the application/process behavior is a function of what that application does with that error.

You can profile an application to determine whether or not swapping occurred during execution. Read the profiler manual.

For the counters in compute 5.x which are described in the nvprof manual, I don’t see any thing about swapping. Maybe the name is not exactly “swap”. Do you know?

compute 5.x doesn’t support demand paging

I mentioned in my first comment that it had to be a pascal or volta GPU (6.x or 7.x).

maxwell (5.x) does not support swapping/paging.

both nvprof and the visual profiler can display data about Page Faults

Please read the profiler manual, paying attention to Page Fault

I don’t see any metrics for this, but I didn’t look carefully.

On M2000 (5.x), The program uses less than 4GB of memory. As I run it with nvprof, I have noticed that the memory usage increases which means that the events I selected use memory. That is fine though.
For a test, I applied too many events. The program hasn’t been killed yet.

Prior to that and without nvprof, I increased the problem size and I am sure that the memory usage should be more than 4GB. However, the program wouldn’t be killed either and it had no progress. More precisely, the progress was really slow.

So, I think M2000 uses swap or paging or anything else in order not to kill the program.

You said (and the manual says [1]) that Maxwell has no event related to paging or swapping. How that can be justified?


It’s OK if you disagree with me or don’t believe me.

I can’t explain the behavior of a program you haven’t shown. If you want to claim that M2000 employs paging, you’re welcome to believe that. I wouldn’t claim that.

I don’t know what it means to say “How that can be justified?”

Are you asking “Why are there no metrics related to demand-paging?”

If so, I don’t know the reason why on cc6.x or cc7.x. But on cc5.x I wouldn’t expect to see any metrics related to demand paging, because a cc5.x device does not support demand paging. Of course if you disagree, you’re welcome to your opinion, but I wouldn’t be able to respond to anything based on that.

from the top of my memory, maxwell becomes much slower when almost entire memory is used and you access memory pages in random fashion - the reason probably is limited size of TLB cache. so if you wrong and your program is using a little less than 4 GB memory, it may be reason of slowdown.

I am not talking about personal points of view. Let me state the problem in another way. Forget about nvprof…

I have a gpu binary file which is run on M2000. When the input size is small, the program runs fine. However, when the input size is large, the memory usage reported by nvidia-smi is the max value. The screen sometimes becomes unstable. Window refreshing is also slow some times.

The input size I gave to the program should be larger than 4GB. If there is no swapping mechanism, the program must be killed. But it is alive! So I interpret that means there is a swap/page solution on M2000. I am not aware of internals of maxwell. This is what I see.

Any comment?

I agree with you. That means, a careful run should be done in order not enter the slow down.

try to make very simple program that allocs a little more than 4 GiB, fills entire array with data and exits. may be, you don’t take into account difference between GB and GiB? :)

For gpu oversubscription, can we limit the host memory for swapping?

I assume by “for swapping” you mean migration of data between host and device.

You cannot limit the host memory. The way to limit the host memory is to reduce your allocation size.

If you allocate 100GB of managed data, that will use potentially 100GB of host memory.

If by “for swapping” you meant swapping from host memory to disk, that has nothing to do with managed memory or this discussion.

I indeed mean migration of data between host and device. Thanks for your reply.