Unified memory (cudaMallocManaged) unable to oversubscribe GPU memory on sm_60, Telsa P100

zjw518 · June 22, 2017, 2:35pm

My understanding is that trying to allocate more than the 16gb on my P100 should “just work” using cudaMallocManaged on Pascal (sm_60, CUDA 8.0). But I get the API error “out of memory” - am I missing something extra I need to do?

njuffa · June 22, 2017, 3:28pm

What operating system? How much system memory?

zjw518 · June 22, 2017, 3:30pm

Ubuntu 16.04, 256gb system memory. It failed when trying to allocate a 26gb array.

Robert_Crovella · June 22, 2017, 3:51pm

I’ve not had any trouble with this.

njuffa · June 22, 2017, 4:03pm

I am wondering whether some OS configuration setting could be at fault, but wouldn’t know what to look for other than “everything related to memory management”. Is the installation customized or mostly using default settings?

zjw518 · June 22, 2017, 4:18pm

Not customized at all, so far as I know - the clustering people who built the machine set it up, but nothing non-standard that I know of.

I’ve been suspicious of CUDA installation issues since nvprof has been giving me issues wrt seg faults (which I posted about a while back to no avail). I would try reinstalling now but I urgently need to use the device… eventually, though.

njuffa · June 22, 2017, 4:57pm

It could be a busted CUDA installation, especially if you used a package installer instead of the runfile installer method, which has not failed me once and which I would recommend. Note that package installer and runfile installer approaches don’t mix (well), so you would want to stick with either one or the other.

As for the OS, I am totally biased because I have seen so many issue with Ubuntu over the years. A Linux distro produced by people who, ah, think differently. My opinion: “Friends don’t let friends use Ubuntu.”

zjw518 · June 22, 2017, 5:01pm

Well, the cluster folks installed it - part of the reason I want to reinstall, so I can do it myself.

Well, my friends use Ubuntu, and I’m not certain I’ll ever care that much about which Linux distro I’m using :)

tera · June 22, 2017, 7:03pm

As far as I can see, Ubuntu (thanks to their LTS releases) is the only non-paid-for Linux distribution that is both supported by CUDA 8.0 and still supported by it’s vendor.

This has forced me to abandon my favourite distro for Ubuntu on work-related machines (even though I personally would be happy to pay a distributor for the service of providing (security-) updates).

I might try out oversubcribing memory unified memory on CC 6.x devices under Ubuntu 16.04 at some point, but due to distro-related problems I currently have it might take me several days until I get around doing it.

zjw518 · June 22, 2017, 7:07pm

Just re-installed CUDA using the runfile method… absolutely no difference. Can’t oversubscribe, and nvprof still gives me the damn segmentation fault errors while the program alone (and memcheck) work fine.

njuffa · June 22, 2017, 7:35pm

How fast are non-paid-for Linux distros dropping support these days??

I am highly skeptical that the current “versionitis” that seems to have gripped the software industry as a whole is doing anybody any good.

tera · June 22, 2017, 7:51pm

About two years maximum. Which leaves about six months after they start being supported by CUDA.

Note I’m not blaming anyone. You get what you pay for (and a lot more indeed).

zjw518 · June 24, 2017, 4:59pm

I looked into this a bit further, and what’s really going on is that I can’t allocate more memory than the device’s amount to a single array. I can oversubscribe as long as each array is < 16GB. Is this expected behavior? Is there any way around this?

njuffa · June 24, 2017, 8:08pm

I would suggest you wait for an answer from a more knowledgeable person for an authoritative answer, but the limitation you describe seems plausible to me. Since managed memory needs to be mapped into both of the CPU and the GPU address spaces, and the GPU only has 16 GB to map, a single allocation cannot exceed the physical size of the GPU memory (probably a little bit less as CUDA itself needs some memory).

BulatZiganshin · June 24, 2017, 8:15pm

Initially, managed memory moved an entire array to CPU/GPU as required. This automatically limits allocation to VRAM size. I don’t remember whether they already implemented paging in P100 (and which drivers it require?) or we need to wait Volta?

tera · June 24, 2017, 9:31pm

The Programming Guide Section on GPU Memory Oversubscription doesn’t mention such a limit to a single array, and indeed claims CC 6.x devices to have a full 49 bit virtual address space.

Can you post the exact code you use to allocate the array? As txbob has already tested that allocating a 26GB array worked for him, I am somehow suspicious that integer overflow is the issue here, rather than device memory size.

zjw518 · June 24, 2017, 9:35pm

The allocation line is

cudaMallocManaged((void**) &array, 2 * 512 * 512 * 512 * 13 * sizeof(double)) );

which amounts to 2512^313 = 3,489,660,928 doubles = 27,917,287,424 bytes.

I would be very happy to not have to divide these arrays by hand - it wouldn’t be the worst thing ever (that factor of 2 can easily be split), but I would like to avoid differentiating the CPU and GPU implementations of my code any further.

njuffa · June 24, 2017, 9:48pm

Integer overflow, as tera suspected. ‘int’ can only represent values up to 2**31-1 = 2,147,483,647. Try this instead:

sizeof(double) * 2 * 512 * 512 * 512 * 13

which will cause the expression to be evaluated using size_t.

zjw518 · June 24, 2017, 10:16pm

Goodness, here’s something I’ll never mess up again. That’s exactly it, thank you.

While we’re here, if anyone has any experience - are there any “best practices” for dealing with problem sizes that exceed GPU RAM? (In my case, by a factor of ~<4 at the end of the day.) I would assume that letting CUDA take care of page faulting is optimal.

I’m hoping that enough computation can fit onto the device at once to mask the host-device transfers, since I know this is the most costly aspect of GPU programming.

tera · June 24, 2017, 10:28pm

Using managed memory and relying on page faulting certainly isn’t “optimal” - manually moving the right data to the device ahead of it being accessed will always be faster.
But using prefetch and usage hints you may get very close to optimal performance with a lot less effort.
Unfortunately CUDA, unlike CPU operating systems, doesn’t offer a memAdvise hint for sequential access yet. I hope this will come with a future CUDA release.

Topic		Replies	Views
Unified memory oversubscription and page faults CUDA Programming and Performance	7	2811	March 23, 2018
using cudaMalloc and cudaFree within a loop unspecified launch failure! CUDA Programming and Performance	21	37697	April 23, 2009
Pinned memory limit CUDA Programming and Performance	16	13498	May 1, 2016
Unified memory and overprovisioning CUDA Programming and Performance cuda	5	1406	March 6, 2022
Running out of global memory CUDA Programming and Performance	9	2223	December 10, 2021
cudaMalloc error in big loop CUDA Programming and Performance	12	15608	May 21, 2008
cudaMallocManaged allocating more memory than requested CUDA Programming and Performance	7	3207	July 13, 2018
Unified Memory Limits? CUDA on Windows Subsystem for Linux	7	3711	July 6, 2022
cudaMalloc causes segmentation fault 2 Mo is far from my 1,2 Go card memory limit CUDA Programming and Performance	7	7469	June 28, 2011
Splitting large datasets to fit into device memory Algorithm, implementation, and problems CUDA Programming and Performance	1	7420	April 25, 2009

Unified memory (cudaMallocManaged) unable to oversubscribe GPU memory on sm_60, Telsa P100

Related topics