Unified memory oversubscription and page faults

I am using unified memory to store a big array that fully occupies GPU memory and I want the array be stored in GPU memory only (just for experiments, later on I need to store the array in two GPUs).

First I get the available GPU memory via cudaMemGetInfo(), then I use cudaMallocManaged() to allocate unified memory for the array.

Then I initialize the array by launching a kernel with 640 blocks and each with 1024 threads.

The problem is:

  1. First several executions of the application behaves the same. After a few launches, according to nvprof and nvvp, there are lots of GPU page faults. And page faults occurred in the later part of the execution are due to oversubcription (in nvvp it says the page migration is due to eviction)
  2. There are lots of GPU page faults due to writes from the GPU. I tried to give advises and prefectching (even though there are no data to fetch, but I was desperate) before the kernel launch but none of them changed the behavior.

OS: Linux Ubuntu 16.04 64-bit
GPU: GTX 1080
Compiler flag: -arch=sm_61 -o
CUDA & toolkit version: 9.1

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <cuda_runtime_api.h>

#define inttype uint32_t

static const inttype EMPTYVECT32 = 0x7FFFFFFF;

__global__ void init_queue(inttype *d_q, inttype n_elem) {
    inttype nthreads = blockDim.x * gridDim.x;
    inttype i = (blockIdx.x * blockDim.x) + threadIdx.x;

    for(; i < n_elem; i += nthreads) {
    	d_q[i] = (inttype) EMPTYVECT32;
    }
}

int main(int argc, char const *argv[]) {
	inttype *d_q;
	size_t q_size;

	size_t available, total;
	cudaMemGetInfo(&available, &total);
	printf("available: %lu\n", available);

	q_size = available / sizeof(inttype);

	while(cudaMallocManaged((void**)&d_q,  q_size * sizeof(inttype)) == cudaErrorMemoryAllocation);
		
	printf("q_size: %lu = %f GB\n", q_size, (float) q_size / 1024.0 / 1024.0 / 1024.0 * sizeof(inttype));
	init_queue<<<640, 1024>>>(d_q, q_size);
	cudaDeviceSynchronize();

	cudaFree(d_q);

	return 0;
}

This is very strange:

while(cudaMallocManaged((void**)&d_q,  q_size * sizeof(inttype)) == cudaErrorMemoryAllocation);

It is not guaranteed that all of the memory that listed as “available” from cudaMemGetInfo can actually be allocated in a single allocation. Therefore there is no guarantee that your allocation will be able to be entirely device-resident.

I’ve changed the strange “while” into the normal one

cudaMallocManaged((void**)&d_q, q_size * sizeof(inttype));

And the results are the same. Even though the allocation may not be entirely device-resident, but according to the profiling results from nvvp, the page faults (due to writes) almost fills up the “GPU Page Faults” row for the during the kernel execution. It seems that most of pages are not device-resident. Is there a way to fix this?

And except for the page faults due to writes, I still cannot figure out why there are page faults due to oversubscription. Those page faults start to happen at the later half of the kernel execution. I thought by limiting the array size according to the available memory will not cause oversubscription, but apparently it is not the case. Why is this happening?

use cudaMemPrefetchAsync

The cudaMemGetInfo call returns an “available” number that may be higher than what can be actually allocated on the device.

Thanks, I will decrease the size of the array. However, cudaMemPrefetchAsync() makes no difference. The number of page faults is still the same.

cudaMemPrefetchAsync certainly seems to make a difference according to my testing:

$ cat t69.cu
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>

#define inttype uint32_t

static const inttype EMPTYVECT32 = 0x7FFFFFFF;

__global__ void init_queue(inttype *d_q, inttype n_elem) {
    inttype nthreads = blockDim.x * gridDim.x;
    inttype i = (blockIdx.x * blockDim.x) + threadIdx.x;

    for(; i < n_elem; i += nthreads) {
        d_q[i] = (inttype) EMPTYVECT32;
    }
}

int main(int argc, char const *argv[]) {
        inttype *d_q;
        size_t q_size;

        size_t available, total;
        cudaMemGetInfo(&available, &total);
        printf("available: %lu\n", available);

        q_size = available / (4*sizeof(inttype));

        cudaMallocManaged((void**)&d_q,  q_size * sizeof(inttype));
#ifdef USE_PREFETCH
        cudaMemPrefetchAsync(d_q, q_size*sizeof(inttype), 0);
#endif
        printf("q_size: %lu = %f GB\n", q_size, (float) q_size / 1024.0 / 1024.0 / 1024.0 * sizeof(inttype));
        init_queue<<<640, 1024>>>(d_q, q_size);
        cudaDeviceSynchronize();

        cudaFree(d_q);

        return 0;
}
$ nvcc -arch=sm_60 -o t69 t69.cu
$ CUDA_VISIBLE_DEVICES="0" nvprof ./t69
==11329== NVPROF is profiling process 11329, command: ./t69
available: 16726163456
q_size: 1045385216 = 3.894363 GB
==11329== Profiling application: ./t69
==11329== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  590.98ms         1  590.98ms  590.98ms  590.98ms  init_queue(unsigned int*, unsigned int)
      API calls:   50.82%  590.99ms         1  590.99ms  590.99ms  590.99ms  cudaDeviceSynchronize
                   30.78%  357.96ms         1  357.96ms  357.96ms  357.96ms  cudaMemGetInfo
                   16.43%  191.05ms         1  191.05ms  191.05ms  191.05ms  cudaFree
                    1.79%  20.773ms         1  20.773ms  20.773ms  20.773ms  cudaMallocManaged
                    0.10%  1.1447ms        94  12.178us     414ns  483.79us  cuDeviceGetAttribute
                    0.06%  704.27us         1  704.27us  704.27us  704.27us  cuDeviceTotalMem
                    0.01%  123.01us         1  123.01us  123.01us  123.01us  cuDeviceGetName
                    0.01%  101.63us         1  101.63us  101.63us  101.63us  cudaLaunch
                    0.00%  5.9790us         2  2.9890us     266ns  5.7130us  cudaSetupArgument
                    0.00%  4.8860us         3  1.6280us     435ns  2.8640us  cuDeviceGetCount
                    0.00%  2.7650us         1  2.7650us  2.7650us  2.7650us  cudaConfigureCall
                    0.00%  2.5320us         2  1.2660us     541ns  1.9910us  cuDeviceGet

==11329== Unified Memory profiling result:
Device "Tesla P100-PCIE-16GB (0)"
   Count  Avg Size  Min Size  Max Size  Total Size  Total Time  Name
    6022         -         -         -           -  746.5049ms  Gpu page fault groups
$ nvcc -arch=sm_60 -o t69 t69.cu -DUSE_PREFETCH
$ CUDA_VISIBLE_DEVICES="0" nvprof ./t69
==11390== NVPROF is profiling process 11390, command: ./t69
available: 16726163456
q_size: 1045385216 = 3.894363 GB
==11390== Profiling application: ./t69
==11390== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  9.3382ms         1  9.3382ms  9.3382ms  9.3382ms  init_queue(unsigned int*, unsigned int)
      API calls:   53.03%  402.06ms         1  402.06ms  402.06ms  402.06ms  cudaMemGetInfo
                   22.85%  173.22ms         1  173.22ms  173.22ms  173.22ms  cudaFree
                   19.89%  150.80ms         1  150.80ms  150.80ms  150.80ms  cudaMemPrefetchAsync
                    2.70%  20.435ms         1  20.435ms  20.435ms  20.435ms  cudaMallocManaged
                    1.23%  9.3560ms         1  9.3560ms  9.3560ms  9.3560ms  cudaDeviceSynchronize
                    0.15%  1.1204ms        94  11.919us     400ns  461.95us  cuDeviceGetAttribute
                    0.09%  706.76us         1  706.76us  706.76us  706.76us  cuDeviceTotalMem
                    0.04%  338.89us         1  338.89us  338.89us  338.89us  cudaLaunch
                    0.02%  135.69us         1  135.69us  135.69us  135.69us  cuDeviceGetName
                    0.00%  13.094us         2  6.5470us     366ns  12.728us  cudaSetupArgument
                    0.00%  6.5940us         3  2.1980us     564ns  3.2730us  cuDeviceGetCount
                    0.00%  3.9600us         1  3.9600us  3.9600us  3.9600us  cudaConfigureCall
                    0.00%  2.6640us         2  1.3320us     821ns  1.8430us  cuDeviceGet
$
  1. Note that I am using CUDA_VISIBLE_DEVICES to limit the CUDA runtime footprint to a single device.
  2. Without the -DUSE_PREFETCH, nvprof reports ~6000 page faults. With the -DUSE_PREFETCH, nvprof does not report any page faults (and kernel execution time is dramatically faster.)

Thank you very much! I noticed that I provided (void**)&d_q rather than d_q as the parameter for cudaMemPrefetchAsyn().

Now I have no page faults.

Any time you are having trouble with a CUDA code, its good practice to do proper CUDA error checking. cudaMemPrefetchAsync would return an error code (invalid argument) in that scenario, if you pass an incorrect pointer.