Significant performance problem with Unified Memory based on driver version

rene.winchenbach · July 19, 2018, 10:48am

The problem I am having is that Unified Memory is significantly slower than non Unified Memory access (by a factor of 30x~). This problem first showed up in my daily codebase but can easily be reproduced by for example using the ConjugateGradient sample included with the Toolkit. If in this sample (lines 145-151) I replace

checkCudaErrors(cudaMalloc((void **)&d_col, nz*sizeof(int)));
    checkCudaErrors(cudaMalloc((void **)&d_row, (N+1)*sizeof(int)));
    checkCudaErrors(cudaMalloc((void **)&d_val, nz*sizeof(float)));
    checkCudaErrors(cudaMalloc((void **)&d_x, N*sizeof(float)));
    checkCudaErrors(cudaMalloc((void **)&d_r, N*sizeof(float)));
    checkCudaErrors(cudaMalloc((void **)&d_p, N*sizeof(float)));
    checkCudaErrors(cudaMalloc((void **)&d_Ax, N*sizeof(float)));

with

checkCudaErrors(cudaMallocManaged((void **)&d_col, nz*sizeof(int)));
    checkCudaErrors(cudaMallocManaged((void **)&d_row, (N+1)*sizeof(int)));
    checkCudaErrors(cudaMallocManaged((void **)&d_val, nz*sizeof(float)));
    checkCudaErrors(cudaMallocManaged((void **)&d_x, N*sizeof(float)));
    checkCudaErrors(cudaMallocManaged((void **)&d_r, N*sizeof(float)));
    checkCudaErrors(cudaMallocManaged((void **)&d_p, N*sizeof(float)));
    checkCudaErrors(cudaMallocManaged((void **)&d_Ax, N*sizeof(float)));

and I time the for loop that does the actual CG solving with N = 1048576 * 32; the time for the original version is around 80ms whilst the version with unified memory takes 2800ms. The slow down happens on every kernel invocation and is not just in the first loop/access to memory. The problem also appears regardless of using cuBLAS functions like in this sample or using manually written Functions. The problem appears on the following tested driver versions (Windows):
397.55
397.64
397.93
398.36

with these driver versions the issue appears when using CUDA 9.1 (with and without update 1) and when using CUDA 9.2. The issue also appears on the latest windows build and the latest insider update. The issue also appears on the following tested GPUs:
Titan X (Maxwell)
1080ti
1060ti (6GB and 3GB)

The issue also appears on different computers with different CPUs (AMD and Intel) on independent systems.

Further oddities that might point to a solution: When trying to profile the code visual profiler provides the following warning

==14984== Warning: Unified Memory Profiling is not supported on the current configuration because a pair of devices without peer-to-peer support is detected on this multi-GPU setup. When peer mappings are not available, system falls back to using zero-copy memory. It can cause kernels, which access unified memory, to run slower. More details can be found at: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-managed-memory

again only on newer driver versions. Considering the multi-GPU warning: This warning appears regardless of using 1 or 2 GPUs physically installed in the system, it also appears when setting the CUDA_VISIBLE_DEVICES environment variable manually and when providing CUDA_MANAGED_FORCE_DEVICE_ALLOC.

Regardless of this warning the profiler shows a significant usage of System Memory ( in the red with transfer speeds of ~8GB/s for reads and 700MB/s of writes for each invocation of the main kernel) which is odd as program allocates 7 arrays of size 100663294 each with 4 Byte data so roughly 2.7 GB. Read counts are 47838113 and write counts 4194313 for a single invocation for system memory.

I hope someone can point to any possible solution for this except not using a newer driver which also removes the ability of using CUDA 9.2. Due to this requirement the issue appears on all tested drivers with CUDA 9.2 as this version of CUDA requires one of these drivers. CUDA 9.1 does not show the issue on drivers older (e.g. 391.35).

Robert_Crovella · July 21, 2018, 1:43pm

my suggestion would be to file a bug at developer.nvidia.com

spudz76 · July 31, 2018, 12:46pm

I think this may also break CUDA8 apps, I have had bad luck running them against the 9.2-containing drivers and generally downgrade to the last 9.1 or even 8 level. Kernels appear executed fine but data handoff gets corrupted sometimes (hard to pin down cause / can happen during run / seems random).

CUDA8 apps worked great against 9.1-containing drivers however so it may be related to this 9.2 UVM change. Thanks for the breadcrumb!

Topic		Replies	Views
Bad performance when using unified memory CUDA Programming and Performance	2	3383	April 21, 2019
"Unified Memory Profiling is not supported ..." warning 3348 Visual Profiler and nvprof	15	5732	September 20, 2018
Unified memory oversubscription and page faults CUDA Programming and Performance	7	2784	March 23, 2018
unified memory profiling failed Visual Profiler and nvprof	12	6105	June 17, 2018
Accessing Managed Memory During Asynchronous Copies CUDA Programming and Performance	4	356	March 4, 2024
Using unified memory causes system crash CUDA Programming and Performance	28	5815	February 4, 2019
Pascal & capabilities 6.0 show cudaDevAttrConcurrentManagedAccess is 0 CUDA Programming and Performance	15	1367	December 27, 2018
Unified memory not working in multi GPU system CUDA Programming and Performance	5	2189	February 23, 2023
Unified Memory in Pascal CUDA Programming and Performance	5	846	September 24, 2018
Program crashes when using Unified Memory CUDA Programming and Performance	0	497	October 28, 2018

Significant performance problem with Unified Memory based on driver version

Related topics