Unified memory (cudaMallocManaged) unable to oversubscribe GPU memory on sm_60, Telsa P100

Thanks, tera.

I’m getting seg faults, and I’m unsure if it’s still an integer overflow issue or a CUDA issue. I have changed my array indices (I’m using index functions) to unsigned long long ints but get the seg faults when trying to access the second half of the array (the largest indices). Should I be able to use long long’s as array indices here? Namely, should I expect this to work normally, such that, since it does not, I can deduce that the issue is related to CUDA?

Yes, you can use 64-bit integer types as indexes, it should only slow down the indexing computations but otherwise be fully functional.

I would triple check all the integer computations that feed into the indexing computation. An integer overflow may occur before the data is ever assigned to a 64-bit integer type. It’s a fairly common bug, and especially easily overlooked when some of that computation is hidden inside macros.

Thanks for saving the day again, njuffa (and everyone else)! I wasn’t using macros - just inline functions - and setting the argument types to unsigned longs rather than ints did the trick.

Alas, it seems this will be untenably slow. A kernel which executed in .235 seconds at 1/8th the resolution took ~30 minutes. LOL.

As tera pointed out, you may be better off by tiling the work manually, and then taking advantage of the fact that upload to the CPU and download to the GPU via DMA can run concurrently with kernel execution. If you build a processing pipeline in this fashion using a double-buffering scheme, you may be able to set up out-of-core processing at close to full performance (= same throughput as for smaller in-core problems).

This has certainly been done before. Just recently someone posted in these forums pointing to a solver library able to handle huge matrices out-of-core with impressive performance.