Hi,
I am currently tuning the performance of Tensorflow with memory oversubscription feature by Unified Memory on IBM AC922.
Tensorflow firstly pre-allocates a huge memory space using cuMemAllocManaged as a memory pool.
Then during runtime, it allocates space for tensors from the memory pool, and deallocates when a tensor will be no longer used.
After conducting some profiling, I found a large number of page faults are caused by writing data to a newly allocated space (new tensor).
These page faults are followed by host-to-device (HtoD) page migration that copies the page from system memory to GPU memory.
Since logically we know the allocation is new, it is unnecessary to do the copy; but instead, it is better to invalidate the page on system memory and allocate a new one on GPU, with the same virtual memory address.
Is there any way to avoid such unnecessary copies? I guess it would, to a great extent, speed up DNN training.
Thanks in advance,
Chi-Chung
In an oversubscription scenario, if the GPU writes to the complete range of an oversubcribed single allocation, it should be evident that some D->H traffic will be inevitable.
I can’t explain the H->D traffic immediately, but there is the possibility of thrashing.
In an oversubscribed scenario, if the GPU writes to a range of memory that will fit within GPU memory (only) then there is no migration of data in the no-host-write case, according to my testing.
$ cat t398.cu
#include <stdio.h>
#include <cuda_profiler_api.h>
__global__ void k(int *d){
size_t idx=threadIdx.x+blockDim.x*blockIdx.x;
d[idx] = 0;
}
int main(){
size_t ds = 5ULL*1024ULL*1048576ULL; // 20 GB allocation on 16GB GPU
int *d;
cudaMallocManaged(&d, ds*sizeof(int));
#ifdef HOST_WRITE
for (size_t i = 0; i < ds; i++) d[i] = 1;
#endif
#ifndef USE_8
size_t ks = ds;
#else
size_t ds_8GB = 2ULL*1024ULL*1048576ULL; // 8 GB
size_t ks = ds_8GB;
#endif
k<<<ks/256, 256>>>(d);
cudaDeviceSynchronize();
cudaProfilerStop();
}
$ nvcc -arch=sm_60 -o t398 t398.cu -DUSE_8
$ nvprof --device-buffer-size 1024 ./t398
==31926== NVPROF is profiling process 31926, command: ./t398
==31926== Profiling application: ./t398
==31926== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 100.00% 1.45063s 1 1.45063s 1.45063s 1.45063s k(int*)
API calls: 61.86% 1.45069s 1 1.45069s 1.45069s 1.45069s cudaDeviceSynchronize
37.85% 887.77ms 1 887.77ms 887.77ms 887.77ms cudaMallocManaged
0.20% 4.7682ms 384 12.417us 365ns 509.98us cuDeviceGetAttribute
0.06% 1.4154ms 4 353.86us 229.38us 536.43us cuDeviceTotalMem
0.02% 450.02us 4 112.50us 97.579us 136.42us cuDeviceGetName
0.01% 132.51us 1 132.51us 132.51us 132.51us cudaLaunchKernel
0.00% 23.042us 4 5.7600us 4.3950us 8.2050us cuDeviceGetPCIBusId
0.00% 8.4410us 8 1.0550us 443ns 1.8120us cuDeviceGet
0.00% 5.0430us 3 1.6810us 423ns 2.3320us cuDeviceGetCount
0.00% 2.2140us 4 553ns 454ns 786ns cuDeviceGetUuid
==31926== Unified Memory profiling result:
Device "Tesla P100-PCIE-16GB (0)"
Count Avg Size Min Size Max Size Total Size Total Time Name
23424 - - - - 1.429526s Gpu page fault groups
$ nvcc -arch=sm_60 -o t398 t398.cu
$ nvprof --device-buffer-size 1024 ./t398
==31969== NVPROF is profiling process 31969, command: ./t398
==31969== Profiling application: ./t398
==31969== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 100.00% 4.92223s 1 4.92223s 4.92223s 4.92223s k(int*)
API calls: 84.15% 4.92231s 1 4.92231s 4.92231s 4.92231s cudaDeviceSynchronize
15.73% 920.32ms 1 920.32ms 920.32ms 920.32ms cudaMallocManaged
0.08% 4.9341ms 384 12.849us 303ns 523.25us cuDeviceGetAttribute
0.03% 1.4693ms 4 367.32us 227.18us 570.92us cuDeviceTotalMem
0.01% 441.40us 4 110.35us 96.085us 131.23us cuDeviceGetName
0.00% 124.54us 1 124.54us 124.54us 124.54us cudaLaunchKernel
0.00% 22.922us 4 5.7300us 4.4880us 8.1420us cuDeviceGetPCIBusId
0.00% 9.5170us 8 1.1890us 512ns 2.7860us cuDeviceGet
0.00% 6.5430us 3 2.1810us 366ns 3.7320us cuDeviceGetCount
0.00% 2.2490us 4 562ns 443ns 742ns cuDeviceGetUuid
==31969== Unified Memory profiling result:
Device "Tesla P100-PCIE-16GB (0)"
Count Avg Size Min Size Max Size Total Size Total Time Name
88761 47.254KB 4.0000KB 988.00KB 4.000000GB 472.8389ms Host To Device
2761 2.0000MB 2.0000MB 2.0000MB 5.392578GB 452.4796ms Device To Host
58213 - - - - 4.858694s Gpu page fault groups
$ nvcc -arch=sm_60 -o t398 t398.cu -DHOST_WRITE
$ nvprof --device-buffer-size 1024 ./t398
==32021== NVPROF is profiling process 32021, command: ./t398
==32021== Profiling application: ./t398
==32021== Warning: Found 82 invalid records in the result.
==32021== Warning: This can happen if device ran out of memory or if a device kernel was stopped due to an assertion.
==32021== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 100.00% 6.45874s 1 6.45874s 6.45874s 6.45874s k(int*)
API calls: 85.08% 6.45879s 1 6.45879s 6.45879s 6.45879s cudaDeviceSynchronize
14.82% 1.12464s 1 1.12464s 1.12464s 1.12464s cudaMallocManaged
0.07% 5.3009ms 384 13.804us 355ns 582.09us cuDeviceGetAttribute
0.02% 1.6069ms 4 401.72us 266.27us 616.44us cuDeviceTotalMem
0.01% 556.18us 4 139.05us 106.50us 190.73us cuDeviceGetName
0.00% 195.35us 1 195.35us 195.35us 195.35us cudaLaunchKernel
0.00% 23.848us 4 5.9620us 4.2110us 8.1680us cuDeviceGetPCIBusId
0.00% 7.9820us 8 997ns 477ns 1.6850us cuDeviceGet
0.00% 6.3990us 3 2.1330us 332ns 3.9300us cuDeviceGetCount
0.00% 2.6670us 4 666ns 534ns 886ns cuDeviceGetUuid
==32021== Unified Memory profiling result:
Device "Tesla P100-PCIE-16GB (0)"
Count Avg Size Min Size Max Size Total Size Total Time Name
436462 47.975KB 4.0000KB 988.00KB 19.96922GB 2.306184s Host To Device
2757 2.0000MB 2.0000MB 2.0000MB 5.384766GB 451.1309ms Device To Host
57850 - - - - 6.386874s Gpu page fault groups
Total CPU Page faults: 61440
$
Tesla P100, CUDA 10.0, CentOS 7