Unified memory (cudaMallocManaged) unable to oversubscribe GPU memory on sm_60, Telsa P100

zjw518 · June 25, 2017, 2:56pm

Thanks, tera.

I’m getting seg faults, and I’m unsure if it’s still an integer overflow issue or a CUDA issue. I have changed my array indices (I’m using index functions) to unsigned long long ints but get the seg faults when trying to access the second half of the array (the largest indices). Should I be able to use long long’s as array indices here? Namely, should I expect this to work normally, such that, since it does not, I can deduce that the issue is related to CUDA?

njuffa · June 25, 2017, 3:08pm

Yes, you can use 64-bit integer types as indexes, it should only slow down the indexing computations but otherwise be fully functional.

I would triple check all the integer computations that feed into the indexing computation. An integer overflow may occur before the data is ever assigned to a 64-bit integer type. It’s a fairly common bug, and especially easily overlooked when some of that computation is hidden inside macros.

zjw518 · June 25, 2017, 8:20pm

Thanks for saving the day again, njuffa (and everyone else)! I wasn’t using macros - just inline functions - and setting the argument types to unsigned longs rather than ints did the trick.

Alas, it seems this will be untenably slow. A kernel which executed in .235 seconds at 1/8th the resolution took ~30 minutes. LOL.

njuffa · June 25, 2017, 8:27pm

As tera pointed out, you may be better off by tiling the work manually, and then taking advantage of the fact that upload to the CPU and download to the GPU via DMA can run concurrently with kernel execution. If you build a processing pipeline in this fashion using a double-buffering scheme, you may be able to set up out-of-core processing at close to full performance (= same throughput as for smaller in-core problems).

This has certainly been done before. Just recently someone posted in these forums pointing to a solver library able to handle huge matrices out-of-core with impressive performance.

Topic		Replies	Views
Unified memory and overprovisioning CUDA Programming and Performance cuda	5	1582	March 6, 2022
Using unified memory in GTX 1070 with CUDA 8 CUDA Programming and Performance	4	1561	October 29, 2016
Unified memory oversubscription and page faults CUDA Programming and Performance	7	2952	March 23, 2018
Running out of global memory CUDA Programming and Performance	9	2443	December 10, 2021
Unified Memory on W10 Issue - Cannot Alloc More than VRAM Size CUDA	1	426	March 18, 2020
Maximum size of memory block in cudaMallocManaged() CUDA Programming and Performance	7	2666	November 28, 2017
cuMemAllocManaged returns out of memory with -stdpar=gpu nvc, nvc++ and nvfortran	5	724	February 6, 2023
CudaMallocmanaged() can not exceed more than 65410 iterartions CUDA Programming and Performance	1	621	July 13, 2016
Noob - Seg fault from just populating a large array in unified memory? CUDA Programming and Performance	3	955	October 12, 2021
cudaMallocManaged allocating more memory than requested CUDA Programming and Performance	7	3382	July 13, 2018

Unified memory (cudaMallocManaged) unable to oversubscribe GPU memory on sm_60, Telsa P100

Related topics