The problem I am having is that Unified Memory is significantly slower than non Unified Memory access (by a factor of 30x~). This problem first showed up in my daily codebase but can easily be reproduced by for example using the ConjugateGradient sample included with the Toolkit. If in this sample (lines 145-151) I replace
checkCudaErrors(cudaMalloc((void **)&d_col, nz*sizeof(int)));
checkCudaErrors(cudaMalloc((void **)&d_row, (N+1)*sizeof(int)));
checkCudaErrors(cudaMalloc((void **)&d_val, nz*sizeof(float)));
checkCudaErrors(cudaMalloc((void **)&d_x, N*sizeof(float)));
checkCudaErrors(cudaMalloc((void **)&d_r, N*sizeof(float)));
checkCudaErrors(cudaMalloc((void **)&d_p, N*sizeof(float)));
checkCudaErrors(cudaMalloc((void **)&d_Ax, N*sizeof(float)));
with
checkCudaErrors(cudaMallocManaged((void **)&d_col, nz*sizeof(int)));
checkCudaErrors(cudaMallocManaged((void **)&d_row, (N+1)*sizeof(int)));
checkCudaErrors(cudaMallocManaged((void **)&d_val, nz*sizeof(float)));
checkCudaErrors(cudaMallocManaged((void **)&d_x, N*sizeof(float)));
checkCudaErrors(cudaMallocManaged((void **)&d_r, N*sizeof(float)));
checkCudaErrors(cudaMallocManaged((void **)&d_p, N*sizeof(float)));
checkCudaErrors(cudaMallocManaged((void **)&d_Ax, N*sizeof(float)));
and I time the for loop that does the actual CG solving with N = 1048576 * 32; the time for the original version is around 80ms whilst the version with unified memory takes 2800ms. The slow down happens on every kernel invocation and is not just in the first loop/access to memory. The problem also appears regardless of using cuBLAS functions like in this sample or using manually written Functions. The problem appears on the following tested driver versions (Windows):
397.55
397.64
397.93
398.36
with these driver versions the issue appears when using CUDA 9.1 (with and without update 1) and when using CUDA 9.2. The issue also appears on the latest windows build and the latest insider update. The issue also appears on the following tested GPUs:
Titan X (Maxwell)
1080ti
1060ti (6GB and 3GB)
The issue also appears on different computers with different CPUs (AMD and Intel) on independent systems.
Further oddities that might point to a solution: When trying to profile the code visual profiler provides the following warning
==14984== Warning: Unified Memory Profiling is not supported on the current configuration because a pair of devices without peer-to-peer support is detected on this multi-GPU setup. When peer mappings are not available, system falls back to using zero-copy memory. It can cause kernels, which access unified memory, to run slower. More details can be found at: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-managed-memory
again only on newer driver versions. Considering the multi-GPU warning: This warning appears regardless of using 1 or 2 GPUs physically installed in the system, it also appears when setting the CUDA_VISIBLE_DEVICES environment variable manually and when providing CUDA_MANAGED_FORCE_DEVICE_ALLOC.
Regardless of this warning the profiler shows a significant usage of System Memory ( in the red with transfer speeds of ~8GB/s for reads and 700MB/s of writes for each invocation of the main kernel) which is odd as program allocates 7 arrays of size 100663294 each with 4 Byte data so roughly 2.7 GB. Read counts are 47838113 and write counts 4194313 for a single invocation for system memory.
I hope someone can point to any possible solution for this except not using a newer driver which also removes the ability of using CUDA 9.2. Due to this requirement the issue appears on all tested drivers with CUDA 9.2 as this version of CUDA requires one of these drivers. CUDA 9.1 does not show the issue on drivers older (e.g. 391.35).