Unified virtual memory slowdown even without migration

Can someone explain why using UVM slows down cudaMemcpy, cudaMemset, and cudaLaunchKernel even on hardware that supports virtual memory and even when there’s no page migration! I know using cudaMemcpy with UVM seems stupid, but I’m adding the support to software that needs to work for a large number of accelerator back ends and currently, it’s too much work to change code that do explicit device memory copies. So UVM will only be used to increase available memory, not easy data sharing.

In the program I’m working on, I’m getting a ~1.5x slow down on (GTX 1080, Linux x86-64), with a single large memory pool size of ~6000 MiB. Almost all the slowdown comes from those 3 functions.

I’ve put together this test program that benchmarks a 4 MiB cudaMemcpy & cudaMemset to a random location in a buffer. The times are from nvprof.

bench_uvm.cu (926 Bytes)

Findings:

  1. cudaMemset had the worst slowdown > 10x. cudaMemsetAsync not so bad.

  2. When the buffer is > 1 GiB, the UVM slow down gets worse and worse. But these charts show performance doesn’t drop until the oversubscription factor is ~0.9.

  3. Calling cudaMemAdvise(… cudaMemAdviseSetPreferredLocation improved speed. Reduced # GPU page fault groups by 3.9x. There is no migration, so a speedup isn’t expected. My guess is by forcing all the memory to be allocated up front, it uses the biggest, 2MiB page size, resulting in fewer TLB misses.

Thanks for any help

Which CUDA version are you using? Which driver are you using? Are the timings for the host-side API call, or for the device-side duration of the operation?

Right, good to clarify. I’m using CUDA 10.0.130 and driver 470.86. The times are for the host side and I did notice for (cudaMemset, size=2^32, managed), the host times (116 ms total) are way higher than device side (606 us)

I would suggest prefetching the allocation to the device in question. Here’s my test case:

$ cat t1956.cu
#include <unistd.h>
#include <stdlib.h>
#include <stdint.h>
#include <iostream>
#include <random>
#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL

unsigned long long dtime_usec(unsigned long long start=0){

  timeval tv;
  gettimeofday(&tv, 0);
  return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}

using namespace std;

const int BLOCK=1<<22;
uint8_t host[BLOCK];


int main()
{
  uint8_t* device;
  size_t N=1<<23;
  if(getenv("SIZE"))
    N=1ull<<atol(getenv("SIZE"));
  cout<<"SIZE="<<N<<endl;
  bool use_managed=false;
  if(getenv("MANAGED"))
    use_managed=true;
  if(use_managed)
    cudaMallocManaged(&device,N);
  else
    cudaMalloc(&device,N);

  if(use_managed)
  {
    std::cout << "Managed!" << std::endl;
    if(cudaMemAdvise(device,N,cudaMemAdviseSetPreferredLocation,0)!=cudaSuccess)
      throw 0;
    if(cudaMemAdvise(device,N,cudaMemAdviseSetAccessedBy,0)!=cudaSuccess)
      throw 0;
  }
  if (getenv("PREFETCH")){
    std::cout << "Prefetch!" << std::endl;
    cudaMemPrefetchAsync(device,N,0);
    cudaDeviceSynchronize();}
  unsigned long long dt = dtime_usec(0);
  mt19937_64 random;
  for(int i=0;i<1000;++i)
    cudaMemcpy(&device[(random()%N)&0xffffff00],host,BLOCK,cudaMemcpyHostToDevice);

  for(int i=0;i<1000;++i)
    cudaMemset(&device[(random()%N)&0xffffff00],0,BLOCK);

  cudaDeviceSynchronize();
  dt = dtime_usec(dt);
  std::cout << "elapsed: " << dt/(float)USECPSEC  << "s" << std::endl;
  return 0;
}

$ nvcc -arch=sm_70 -o t1956 t1956.cu
$ nvprof ./t1956
SIZE=8388608
==25535== NVPROF is profiling process 25535, command: ./t1956
elapsed: 0.503579s
==25535== Profiling application: ./t1956
==25535== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   99.21%  403.40ms       487  828.33us  789.20us  1.0076ms  [CUDA memcpy HtoD]
                    0.79%  3.2271ms       503  6.4150us  6.0160us  8.0000us  [CUDA memset]
      API calls:   61.48%  499.17ms      1000  499.17us     841ns  1.2588ms  cudaMemcpy
                   36.97%  300.16ms         1  300.16ms  300.16ms  300.16ms  cudaMalloc
                    0.59%  4.7655ms         4  1.1914ms  589.19us  2.9773ms  cuDeviceTotalMem
                    0.46%  3.7047ms       404  9.1690us     340ns  1.4995ms  cuDeviceGetAttribute
                    0.45%  3.6880ms      1000  3.6880us     682ns  72.958us  cudaMemset
                    0.05%  425.58us         4  106.40us  59.357us  233.85us  cuDeviceGetName
                    0.00%  20.331us         4  5.0820us  2.8800us  8.4180us  cuDeviceGetPCIBusId
                    0.00%  15.837us         1  15.837us  15.837us  15.837us  cudaDeviceSynchronize
                    0.00%  12.776us         8  1.5970us     528ns  6.0670us  cuDeviceGet
                    0.00%  4.1460us         3  1.3820us     847ns  2.1500us  cuDeviceGetCount
                    0.00%  3.0290us         4     757ns     613ns  1.0760us  cuDeviceGetUuid
$ MANAGED=true nvprof ./t1956
SIZE=8388608
==25557== NVPROF is profiling process 25557, command: ./t1956
Managed!
elapsed: 0.997954s
==25557== Profiling application: ./t1956
==25557== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   99.56%  732.71ms       487  1.5045ms  1.4092ms  2.2901ms  [CUDA memcpy HtoD]
                    0.44%  3.2504ms       503  6.4620us  6.0800us  13.057us  [CUDA memset]
      API calls:   71.58%  974.10ms      1000  974.10us     832ns  3.2177ms  cudaMemcpy
                   26.18%  356.29ms         1  356.29ms  356.29ms  356.29ms  cudaMallocManaged
                    1.69%  22.936ms      1000  22.935us     634ns  114.06us  cudaMemset
                    0.34%  4.5904ms         4  1.1476ms  589.06us  2.8084ms  cuDeviceTotalMem
                    0.18%  2.4274ms       404  6.0080us     327ns  258.38us  cuDeviceGetAttribute
                    0.03%  364.63us         4  91.157us  59.360us  183.41us  cuDeviceGetName
                    0.00%  33.177us         2  16.588us  5.0200us  28.157us  cudaMemAdvise
                    0.00%  27.285us         1  27.285us  27.285us  27.285us  cudaDeviceSynchronize
                    0.00%  20.859us         4  5.2140us  2.9790us  8.0110us  cuDeviceGetPCIBusId
                    0.00%  10.358us         8  1.2940us     416ns  5.0230us  cuDeviceGet
                    0.00%  3.3650us         3  1.1210us     717ns  1.5080us  cuDeviceGetCount
                    0.00%  3.0960us         4     774ns     660ns     973ns  cuDeviceGetUuid

==25557== Unified Memory profiling result:
Device "Tesla V100-PCIE-32GB (0)"
   Count  Avg Size  Min Size  Max Size  Total Size  Total Time  Name
       3         -         -         -           -  1.402242ms  Gpu page fault groups
$ MANAGED=true PREFETCH=true nvprof ./t1956
SIZE=8388608
==25603== NVPROF is profiling process 25603, command: ./t1956
Managed!
Prefetch!
elapsed: 0.557461s
==25603== Profiling application: ./t1956
==25603== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   99.23%  416.67ms       487  855.58us  789.62us  1.0379ms  [CUDA memcpy HtoD]
                    0.77%  3.2196ms       503  6.4000us  6.0480us  7.9680us  [CUDA memset]
      API calls:   60.87%  540.15ms      1000  540.15us     844ns  1.6697ms  cudaMemcpy
                   36.33%  322.35ms         1  322.35ms  322.35ms  322.35ms  cudaMallocManaged
                    1.86%  16.502ms      1000  16.502us     666ns  532.46us  cudaMemset
                    0.57%  5.0246ms         4  1.2561ms  593.70us  3.2310ms  cuDeviceTotalMem
                    0.28%  2.4521ms       404  6.0690us     330ns  273.48us  cuDeviceGetAttribute
                    0.05%  443.01us         4  110.75us  60.267us  254.92us  cuDeviceGetName
                    0.04%  336.01us         1  336.01us  336.01us  336.01us  cudaMemPrefetchAsync
                    0.00%  27.467us         2  13.733us  4.5790us  22.888us  cudaMemAdvise
                    0.00%  20.777us         2  10.388us  8.9780us  11.799us  cudaDeviceSynchronize
                    0.00%  18.457us         4  4.6140us  2.7880us  8.1370us  cuDeviceGetPCIBusId
                    0.00%  13.383us         4  3.3450us     663ns  11.130us  cuDeviceGetUuid
                    0.00%  9.0940us         8  1.1360us     375ns  4.2620us  cuDeviceGet
                    0.00%  2.8090us         3     936ns     544ns  1.4350us  cuDeviceGetCount
$

Yes, I’ve made some changes to your code. Nothing that materially affects the test case(s) you provided however, I don’t think. We see that in the non-managed case, the cudaMemcpy operations required a total of 403ms and the cudaMemset operations required a total of 3.2ms. In the managed case, the cudaMemcpy operations required a total of 732ms and the cudaMemset operations required a total of 3.3ms. In the managed case with prefetching, the cudaMemcpy oeprations required 416ms and the cudaMemset operations required 3.2ms.

I’m not making any comments about oversubscription in this case. There is no comparable scenario for the non-managed case in that setting.

CUDA 11.4, Tesla V100

Thanks for the prefetch tip. That reduced the end to end slow down of the app from 1.43 to 1.11. I forgot that cudaMallocManaged just like regular malloc doesn’t commit any physical memory since it’s virtual. I noticed that # GPU page faults reduced to 0, which completely explains why it was slow earlier. I also changed all the cudaMemset and all the cudaMemcpyHostToDevice calls to the async. version (only cudaMemcpyDeviceToHost needs to be synchronous).

Here’s the updated benchmark. There still is a 3 to 10% slowdown for cudaMemcpyAsync & cudaMemsetAsync from using UVM, but I guess that’s tolerable.
CUDA_UVM_benchmark.ods (15.9 KB)

It would be good to know why the synchronous versions get such a big slow down from UVM and why there’s still that small slow down in the 1st place even without migration. I noticed cudaMalloc always returns a high address with bits (0x7f for bits 47:40), but cudaMallocManaged returns a much lower address. So it seems the high, non-UVM range has paging disabled, so no chance of slowing down from TLB misses. Seems similar to how x86 supports both segmentation and paging, but paging can be enabled or disabled within each segment.

From my viewpoint, there isn’t a big slowdown for the actual activity itself, and I believe the data I already presented demonstrates that. For the “synchronous versions”, I believe you are looking at the latency of the API call to arrive at this viewpoint. Because these calls are blocking, the CPU thread does not proceed until the operation is complete, and this shows up as latency in the API call. The latency of a blocking API call depends on what other activities are going on on the GPU, prior to the requested activity. This requires careful analysis of the timeline, and will change from one application design to another. I haven’t done that analysis. But studying the differences in the issued work patterns would be one of the contributors to answer that, in my view.

Managed memory actions sometimes also require additional steps, some of which are outlined here, as compared to similar, non-managed activity. I’m not sure any of that is applicable here, but there might be some examples.

I’ve measured again for RTX 2080, and the UVM overhead is almost nothing (< 1%) for cudaMemcpyAsync & cudaMemsetAsync. Very reassuring.