Unnecessary HtoD page migration overhead on write when using Unified Memory

bryan830401 · February 15, 2019, 6:20am

Hi,

I am currently tuning the performance of Tensorflow with memory oversubscription feature by Unified Memory on IBM AC922.
Tensorflow firstly pre-allocates a huge memory space using cuMemAllocManaged as a memory pool.
Then during runtime, it allocates space for tensors from the memory pool, and deallocates when a tensor will be no longer used.

After conducting some profiling, I found a large number of page faults are caused by writing data to a newly allocated space (new tensor).
These page faults are followed by host-to-device (HtoD) page migration that copies the page from system memory to GPU memory.
Since logically we know the allocation is new, it is unnecessary to do the copy; but instead, it is better to invalidate the page on system memory and allocate a new one on GPU, with the same virtual memory address.

Is there any way to avoid such unnecessary copies? I guess it would, to a great extent, speed up DNN training.

Thanks in advance,
Chi-Chung

Robert_Crovella · February 15, 2019, 8:25pm

In an oversubscription scenario, if the GPU writes to the complete range of an oversubcribed single allocation, it should be evident that some D->H traffic will be inevitable.
I can’t explain the H->D traffic immediately, but there is the possibility of thrashing.

In an oversubscribed scenario, if the GPU writes to a range of memory that will fit within GPU memory (only) then there is no migration of data in the no-host-write case, according to my testing.

$ cat t398.cu
#include <stdio.h>
#include <cuda_profiler_api.h>

__global__ void k(int *d){

  size_t idx=threadIdx.x+blockDim.x*blockIdx.x;
  d[idx] = 0;
}

int main(){

  size_t ds = 5ULL*1024ULL*1048576ULL; // 20 GB allocation on 16GB GPU
  int *d;
  cudaMallocManaged(&d, ds*sizeof(int));
#ifdef HOST_WRITE
  for (size_t i = 0; i < ds; i++) d[i] = 1;
#endif
#ifndef USE_8
  size_t ks = ds;
#else
  size_t ds_8GB = 2ULL*1024ULL*1048576ULL; // 8 GB
  size_t ks = ds_8GB;
#endif
  k<<<ks/256, 256>>>(d);
  cudaDeviceSynchronize();
  cudaProfilerStop();
}
$ nvcc -arch=sm_60 -o t398 t398.cu -DUSE_8
$ nvprof --device-buffer-size 1024 ./t398
==31926== NVPROF is profiling process 31926, command: ./t398
==31926== Profiling application: ./t398
==31926== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  1.45063s         1  1.45063s  1.45063s  1.45063s  k(int*)
      API calls:   61.86%  1.45069s         1  1.45069s  1.45069s  1.45069s  cudaDeviceSynchronize
                   37.85%  887.77ms         1  887.77ms  887.77ms  887.77ms  cudaMallocManaged
                    0.20%  4.7682ms       384  12.417us     365ns  509.98us  cuDeviceGetAttribute
                    0.06%  1.4154ms         4  353.86us  229.38us  536.43us  cuDeviceTotalMem
                    0.02%  450.02us         4  112.50us  97.579us  136.42us  cuDeviceGetName
                    0.01%  132.51us         1  132.51us  132.51us  132.51us  cudaLaunchKernel
                    0.00%  23.042us         4  5.7600us  4.3950us  8.2050us  cuDeviceGetPCIBusId
                    0.00%  8.4410us         8  1.0550us     443ns  1.8120us  cuDeviceGet
                    0.00%  5.0430us         3  1.6810us     423ns  2.3320us  cuDeviceGetCount
                    0.00%  2.2140us         4     553ns     454ns     786ns  cuDeviceGetUuid

==31926== Unified Memory profiling result:
Device "Tesla P100-PCIE-16GB (0)"
   Count  Avg Size  Min Size  Max Size  Total Size  Total Time  Name
   23424         -         -         -           -   1.429526s  Gpu page fault groups
$ nvcc -arch=sm_60 -o t398 t398.cu
$ nvprof --device-buffer-size 1024 ./t398
==31969== NVPROF is profiling process 31969, command: ./t398
==31969== Profiling application: ./t398
==31969== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  4.92223s         1  4.92223s  4.92223s  4.92223s  k(int*)
      API calls:   84.15%  4.92231s         1  4.92231s  4.92231s  4.92231s  cudaDeviceSynchronize
                   15.73%  920.32ms         1  920.32ms  920.32ms  920.32ms  cudaMallocManaged
                    0.08%  4.9341ms       384  12.849us     303ns  523.25us  cuDeviceGetAttribute
                    0.03%  1.4693ms         4  367.32us  227.18us  570.92us  cuDeviceTotalMem
                    0.01%  441.40us         4  110.35us  96.085us  131.23us  cuDeviceGetName
                    0.00%  124.54us         1  124.54us  124.54us  124.54us  cudaLaunchKernel
                    0.00%  22.922us         4  5.7300us  4.4880us  8.1420us  cuDeviceGetPCIBusId
                    0.00%  9.5170us         8  1.1890us     512ns  2.7860us  cuDeviceGet
                    0.00%  6.5430us         3  2.1810us     366ns  3.7320us  cuDeviceGetCount
                    0.00%  2.2490us         4     562ns     443ns     742ns  cuDeviceGetUuid

==31969== Unified Memory profiling result:
Device "Tesla P100-PCIE-16GB (0)"
   Count  Avg Size  Min Size  Max Size  Total Size  Total Time  Name
   88761  47.254KB  4.0000KB  988.00KB  4.000000GB  472.8389ms  Host To Device
    2761  2.0000MB  2.0000MB  2.0000MB  5.392578GB  452.4796ms  Device To Host
   58213         -         -         -           -   4.858694s  Gpu page fault groups
$ nvcc -arch=sm_60 -o t398 t398.cu -DHOST_WRITE
$ nvprof --device-buffer-size 1024 ./t398
==32021== NVPROF is profiling process 32021, command: ./t398
==32021== Profiling application: ./t398
==32021== Warning: Found 82 invalid records in the result.
==32021== Warning: This can happen if device ran out of memory or if a device kernel was stopped due to an assertion.
==32021== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  6.45874s         1  6.45874s  6.45874s  6.45874s  k(int*)
      API calls:   85.08%  6.45879s         1  6.45879s  6.45879s  6.45879s  cudaDeviceSynchronize
                   14.82%  1.12464s         1  1.12464s  1.12464s  1.12464s  cudaMallocManaged
                    0.07%  5.3009ms       384  13.804us     355ns  582.09us  cuDeviceGetAttribute
                    0.02%  1.6069ms         4  401.72us  266.27us  616.44us  cuDeviceTotalMem
                    0.01%  556.18us         4  139.05us  106.50us  190.73us  cuDeviceGetName
                    0.00%  195.35us         1  195.35us  195.35us  195.35us  cudaLaunchKernel
                    0.00%  23.848us         4  5.9620us  4.2110us  8.1680us  cuDeviceGetPCIBusId
                    0.00%  7.9820us         8     997ns     477ns  1.6850us  cuDeviceGet
                    0.00%  6.3990us         3  2.1330us     332ns  3.9300us  cuDeviceGetCount
                    0.00%  2.6670us         4     666ns     534ns     886ns  cuDeviceGetUuid

==32021== Unified Memory profiling result:
Device "Tesla P100-PCIE-16GB (0)"
   Count  Avg Size  Min Size  Max Size  Total Size  Total Time  Name
  436462  47.975KB  4.0000KB  988.00KB  19.96922GB   2.306184s  Host To Device
    2757  2.0000MB  2.0000MB  2.0000MB  5.384766GB  451.1309ms  Device To Host
   57850         -         -         -           -   6.386874s  Gpu page fault groups
Total CPU Page faults: 61440
$

Tesla P100, CUDA 10.0, CentOS 7

Topic		Replies	Views
Improving GPU Memory Oversubscription Performance Technical Blog	5	1034	July 16, 2025
Unified memory oversubscription and page faults CUDA Programming and Performance	7	3011	March 23, 2018
Maximizing Unified Memory Performance in CUDA Technical Blog	18	1632	May 14, 2019
Unified Memory for CUDA Beginners Technical Blog	46	3362	December 1, 2023
Kernel lunch overhead increases significantly (10x) when using unified memory on TK1 and TX1 Jetson TK1	5	3518	August 31, 2018
Explicit page migration without data copy for Unified Memory CUDA Programming and Performance	0	525	April 2, 2018
Beyond GPU Memory Limits with Unified Memory on Pascal Technical Blog	15	1146	March 11, 2022
Zero Copy Memory vs Unified memory CUDA processing Jetson TX1	27	21281	February 23, 2018
Unified Memory in CUDA 6 Technical Blog	87	2928	August 16, 2019
With unified memory, is there a way to overwrite data that was last used on host, on the device, without causing page faults? CUDA Programming and Performance	1	714	May 5, 2022

Unnecessary HtoD page migration overhead on write when using Unified Memory

Related topics