Can someone explain why using UVM slows down cudaMemcpy, cudaMemset, and cudaLaunchKernel even on hardware that supports virtual memory and even when there’s no page migration! I know using cudaMemcpy with UVM seems stupid, but I’m adding the support to software that needs to work for a large number of accelerator back ends and currently, it’s too much work to change code that do explicit device memory copies. So UVM will only be used to increase available memory, not easy data sharing.
In the program I’m working on, I’m getting a ~1.5x slow down on (GTX 1080, Linux x86-64), with a single large memory pool size of ~6000 MiB. Almost all the slowdown comes from those 3 functions.
I’ve put together this test program that benchmarks a 4 MiB cudaMemcpy & cudaMemset to a random location in a buffer. The times are from nvprof.
bench_uvm.cu (926 Bytes)
Findings:
-
cudaMemset had the worst slowdown > 10x. cudaMemsetAsync not so bad.
-
When the buffer is > 1 GiB, the UVM slow down gets worse and worse. But these charts show performance doesn’t drop until the oversubscription factor is ~0.9.
-
Calling cudaMemAdvise(… cudaMemAdviseSetPreferredLocation improved speed. Reduced # GPU page fault groups by 3.9x. There is no migration, so a speedup isn’t expected. My guess is by forcing all the memory to be allocated up front, it uses the biggest, 2MiB page size, resulting in fewer TLB misses.
Thanks for any help
Which CUDA version are you using? Which driver are you using? Are the timings for the host-side API call, or for the device-side duration of the operation?
Right, good to clarify. I’m using CUDA 10.0.130 and driver 470.86. The times are for the host side and I did notice for (cudaMemset, size=2^32, managed), the host times (116 ms total) are way higher than device side (606 us)
I would suggest prefetching the allocation to the device in question. Here’s my test case:
$ cat t1956.cu
#include <unistd.h>
#include <stdlib.h>
#include <stdint.h>
#include <iostream>
#include <random>
#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL
unsigned long long dtime_usec(unsigned long long start=0){
timeval tv;
gettimeofday(&tv, 0);
return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}
using namespace std;
const int BLOCK=1<<22;
uint8_t host[BLOCK];
int main()
{
uint8_t* device;
size_t N=1<<23;
if(getenv("SIZE"))
N=1ull<<atol(getenv("SIZE"));
cout<<"SIZE="<<N<<endl;
bool use_managed=false;
if(getenv("MANAGED"))
use_managed=true;
if(use_managed)
cudaMallocManaged(&device,N);
else
cudaMalloc(&device,N);
if(use_managed)
{
std::cout << "Managed!" << std::endl;
if(cudaMemAdvise(device,N,cudaMemAdviseSetPreferredLocation,0)!=cudaSuccess)
throw 0;
if(cudaMemAdvise(device,N,cudaMemAdviseSetAccessedBy,0)!=cudaSuccess)
throw 0;
}
if (getenv("PREFETCH")){
std::cout << "Prefetch!" << std::endl;
cudaMemPrefetchAsync(device,N,0);
cudaDeviceSynchronize();}
unsigned long long dt = dtime_usec(0);
mt19937_64 random;
for(int i=0;i<1000;++i)
cudaMemcpy(&device[(random()%N)&0xffffff00],host,BLOCK,cudaMemcpyHostToDevice);
for(int i=0;i<1000;++i)
cudaMemset(&device[(random()%N)&0xffffff00],0,BLOCK);
cudaDeviceSynchronize();
dt = dtime_usec(dt);
std::cout << "elapsed: " << dt/(float)USECPSEC << "s" << std::endl;
return 0;
}
$ nvcc -arch=sm_70 -o t1956 t1956.cu
$ nvprof ./t1956
SIZE=8388608
==25535== NVPROF is profiling process 25535, command: ./t1956
elapsed: 0.503579s
==25535== Profiling application: ./t1956
==25535== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 99.21% 403.40ms 487 828.33us 789.20us 1.0076ms [CUDA memcpy HtoD]
0.79% 3.2271ms 503 6.4150us 6.0160us 8.0000us [CUDA memset]
API calls: 61.48% 499.17ms 1000 499.17us 841ns 1.2588ms cudaMemcpy
36.97% 300.16ms 1 300.16ms 300.16ms 300.16ms cudaMalloc
0.59% 4.7655ms 4 1.1914ms 589.19us 2.9773ms cuDeviceTotalMem
0.46% 3.7047ms 404 9.1690us 340ns 1.4995ms cuDeviceGetAttribute
0.45% 3.6880ms 1000 3.6880us 682ns 72.958us cudaMemset
0.05% 425.58us 4 106.40us 59.357us 233.85us cuDeviceGetName
0.00% 20.331us 4 5.0820us 2.8800us 8.4180us cuDeviceGetPCIBusId
0.00% 15.837us 1 15.837us 15.837us 15.837us cudaDeviceSynchronize
0.00% 12.776us 8 1.5970us 528ns 6.0670us cuDeviceGet
0.00% 4.1460us 3 1.3820us 847ns 2.1500us cuDeviceGetCount
0.00% 3.0290us 4 757ns 613ns 1.0760us cuDeviceGetUuid
$ MANAGED=true nvprof ./t1956
SIZE=8388608
==25557== NVPROF is profiling process 25557, command: ./t1956
Managed!
elapsed: 0.997954s
==25557== Profiling application: ./t1956
==25557== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 99.56% 732.71ms 487 1.5045ms 1.4092ms 2.2901ms [CUDA memcpy HtoD]
0.44% 3.2504ms 503 6.4620us 6.0800us 13.057us [CUDA memset]
API calls: 71.58% 974.10ms 1000 974.10us 832ns 3.2177ms cudaMemcpy
26.18% 356.29ms 1 356.29ms 356.29ms 356.29ms cudaMallocManaged
1.69% 22.936ms 1000 22.935us 634ns 114.06us cudaMemset
0.34% 4.5904ms 4 1.1476ms 589.06us 2.8084ms cuDeviceTotalMem
0.18% 2.4274ms 404 6.0080us 327ns 258.38us cuDeviceGetAttribute
0.03% 364.63us 4 91.157us 59.360us 183.41us cuDeviceGetName
0.00% 33.177us 2 16.588us 5.0200us 28.157us cudaMemAdvise
0.00% 27.285us 1 27.285us 27.285us 27.285us cudaDeviceSynchronize
0.00% 20.859us 4 5.2140us 2.9790us 8.0110us cuDeviceGetPCIBusId
0.00% 10.358us 8 1.2940us 416ns 5.0230us cuDeviceGet
0.00% 3.3650us 3 1.1210us 717ns 1.5080us cuDeviceGetCount
0.00% 3.0960us 4 774ns 660ns 973ns cuDeviceGetUuid
==25557== Unified Memory profiling result:
Device "Tesla V100-PCIE-32GB (0)"
Count Avg Size Min Size Max Size Total Size Total Time Name
3 - - - - 1.402242ms Gpu page fault groups
$ MANAGED=true PREFETCH=true nvprof ./t1956
SIZE=8388608
==25603== NVPROF is profiling process 25603, command: ./t1956
Managed!
Prefetch!
elapsed: 0.557461s
==25603== Profiling application: ./t1956
==25603== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 99.23% 416.67ms 487 855.58us 789.62us 1.0379ms [CUDA memcpy HtoD]
0.77% 3.2196ms 503 6.4000us 6.0480us 7.9680us [CUDA memset]
API calls: 60.87% 540.15ms 1000 540.15us 844ns 1.6697ms cudaMemcpy
36.33% 322.35ms 1 322.35ms 322.35ms 322.35ms cudaMallocManaged
1.86% 16.502ms 1000 16.502us 666ns 532.46us cudaMemset
0.57% 5.0246ms 4 1.2561ms 593.70us 3.2310ms cuDeviceTotalMem
0.28% 2.4521ms 404 6.0690us 330ns 273.48us cuDeviceGetAttribute
0.05% 443.01us 4 110.75us 60.267us 254.92us cuDeviceGetName
0.04% 336.01us 1 336.01us 336.01us 336.01us cudaMemPrefetchAsync
0.00% 27.467us 2 13.733us 4.5790us 22.888us cudaMemAdvise
0.00% 20.777us 2 10.388us 8.9780us 11.799us cudaDeviceSynchronize
0.00% 18.457us 4 4.6140us 2.7880us 8.1370us cuDeviceGetPCIBusId
0.00% 13.383us 4 3.3450us 663ns 11.130us cuDeviceGetUuid
0.00% 9.0940us 8 1.1360us 375ns 4.2620us cuDeviceGet
0.00% 2.8090us 3 936ns 544ns 1.4350us cuDeviceGetCount
$
Yes, I’ve made some changes to your code. Nothing that materially affects the test case(s) you provided however, I don’t think. We see that in the non-managed case, the cudaMemcpy operations required a total of 403ms and the cudaMemset operations required a total of 3.2ms. In the managed case, the cudaMemcpy operations required a total of 732ms and the cudaMemset operations required a total of 3.3ms. In the managed case with prefetching, the cudaMemcpy oeprations required 416ms and the cudaMemset operations required 3.2ms.
I’m not making any comments about oversubscription in this case. There is no comparable scenario for the non-managed case in that setting.
CUDA 11.4, Tesla V100
Thanks for the prefetch tip. That reduced the end to end slow down of the app from 1.43 to 1.11. I forgot that cudaMallocManaged just like regular malloc doesn’t commit any physical memory since it’s virtual. I noticed that # GPU page faults reduced to 0, which completely explains why it was slow earlier. I also changed all the cudaMemset and all the cudaMemcpyHostToDevice calls to the async. version (only cudaMemcpyDeviceToHost needs to be synchronous).
Here’s the updated benchmark. There still is a 3 to 10% slowdown for cudaMemcpyAsync & cudaMemsetAsync from using UVM, but I guess that’s tolerable.
CUDA_UVM_benchmark.ods (15.9 KB)
It would be good to know why the synchronous versions get such a big slow down from UVM and why there’s still that small slow down in the 1st place even without migration. I noticed cudaMalloc always returns a high address with bits (0x7f for bits 47:40), but cudaMallocManaged returns a much lower address. So it seems the high, non-UVM range has paging disabled, so no chance of slowing down from TLB misses. Seems similar to how x86 supports both segmentation and paging, but paging can be enabled or disabled within each segment.
From my viewpoint, there isn’t a big slowdown for the actual activity itself, and I believe the data I already presented demonstrates that. For the “synchronous versions”, I believe you are looking at the latency of the API call to arrive at this viewpoint. Because these calls are blocking, the CPU thread does not proceed until the operation is complete, and this shows up as latency in the API call. The latency of a blocking API call depends on what other activities are going on on the GPU, prior to the requested activity. This requires careful analysis of the timeline, and will change from one application design to another. I haven’t done that analysis. But studying the differences in the issued work patterns would be one of the contributors to answer that, in my view.
Managed memory actions sometimes also require additional steps, some of which are outlined here, as compared to similar, non-managed activity. I’m not sure any of that is applicable here, but there might be some examples.
I’ve measured again for RTX 2080, and the UVM overhead is almost nothing (< 1%) for cudaMemcpyAsync & cudaMemsetAsync. Very reassuring.