Unified Memory vs Pinned Host Memory vs GPU Global Memory

Vectorizer · August 29, 2014, 3:11pm

My GPU memory is far too small for a particular problem.
If I use Pinned Host Memory or Unified Memory will GPU threads be able to read/write directly from/to CPU memory or does the GPU global memory still come into play as a staging area?
Are there any limitations on the CPU RAM size; e.g. will the GPU be able access 64GB of CPU RAM?

Robert_Crovella · August 29, 2014, 3:27pm

Unified memory is a programming construct that allows the distinction between host and device memory to be somewhat obscured. But allocations in unified memory must still fit entirely within the available memory of the GPU, to be usable in device code. It will not allow you to “expand” the memory available to the GPU, beyond what is “on board”.

Pinned memory should allow you to expand the memory available to the GPU. There are many specific details which must be attended to. It will depend on what kind of GPU you have, as well as what operating system you are running, especially whether it is 32-bit or 64-bit. Furthermore, pinning memory “removes” it from the host demand-paged virtual memory system. I have successfully pinned 32GB in a 48GB host machine, but pinning 64GB will require significantly more than 64GB in your host machine (as well as other specifics).

Finally, pinned memory at best can achieve an access bandwidth approximately equal to your PCI Express bandwidth (discoverable using the bandwidthTest utility). Such bandwidth is generally quite a bit lower than what is available directly to the on-board GPU memory. So there is no free lunch.

EDIT: The first paragraph was correct when it was written in the CUDA 6 timeframe when unified memory was first introduced. With the advent of CUDA 8, oversubscription of GPU memory became possible in some settings. So in these settings, a managed allocation can exceed the size of the available physical “on board” GPU RAM.

Vectorizer · August 29, 2014, 4:23pm

Thanks for the response.
If I were to use Pinned Memory I cannot believe there would be a cache mechanism (right?), how about any relief from coalescing?

Robert_Crovella · August 29, 2014, 5:20pm

GPU L1 ~~and L2~~ caches should have the same behavior on pinned memory that is mapped as global memory, as any other type of global memory.

You would certainly want to use coalescing as much as possible, just as in any GPU code.

vsm2 · February 14, 2020, 6:05pm

Is this (pinned memory allocation in the host) still true in the case of Volta/Turing Architecture?

Because when I profile with nvprof/nsight compute, I do not see any L2 caching occurring for matrix A and B in the tiled matrix multiplication algorithm. If the data resides in the GPU memory (cudaMalloc) then I have 98% hit rate.

Robert_Crovella · February 15, 2020, 12:26am

The following application reads the same set of data over and over again. The loads are constructed to prevent the compiler from optimizing them out. We are reading 4GB of data, and doing so in less than 10ms. If the loads were happening across the PCIE bus this would take approximately 0.3s. Therefore I claim there is caching.

$ cat t1662.cu
#include <iostream>

#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL
long long dtime_usec(unsigned long long start){

  timeval tv;
  gettimeofday(&tv, 0);
  return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}


template <typename T>
__global__ void k(const T * __restrict__ d, T * __restrict__ s, size_t n){

  T sum = 0;
  for (size_t i = 0; i < n; i++)
    sum += d[blockDim.x*blockIdx.x+((threadIdx.x+i)%blockDim.x)];
  s[blockDim.x*blockIdx.x+threadIdx.x] = sum;
}

typedef float my_t;
const int my_n = 1000;
const int ds = 1048567;
const int nTPB = 256;
int main(){

  my_t *d_d, *d_s;
#ifdef USE_HOST
  cudaHostAlloc(&d_d, ds*sizeof(d_d[0]), cudaHostAllocDefault);
  cudaHostAlloc(&d_s, ds*sizeof(d_d[0]), cudaHostAllocDefault);
#else
  cudaMalloc(&d_d, ds*sizeof(d_d[0]));
  cudaMalloc(&d_s, ds*sizeof(d_s[0]));
#endif
  long long dt = dtime_usec(0);
  k<<<ds/nTPB, nTPB>>>(d_d, d_s, my_n);
  cudaDeviceSynchronize();
  dt = dtime_usec(dt);
  std::cout << "Duration: " << dt/(float)USECPSEC << "s" << std::endl;
  std::cout << "total bytes: " << sizeof(my_t)*my_n*ds << std::endl;
}


$ nvcc -o t1662 t1662.cu -DUSE_HOST
$ ./t1662
Duration: 0.003557s
total bytes: 4194268000
$ nvprof --metrics gld_transactions,gld_throughput ./t1662
==23709== NVPROF is profiling process 23709, command: ./t1662
Duration: 0.011846s
total bytes: 4194268000
==23709== Profiling application: ./t1662
==23709== Profiling result:
==23709== Metric result:
Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "Tesla V100-PCIE-32GB (0)"
    Kernel: void k<float>(float const *, float*, unsigned long)
          1                          gld_transactions                  Global Load Transactions   159705000   159705000   159705000
          1                            gld_throughput                    Global Load Throughput  1378.3GB/s  1378.3GB/s  1378.3GB/s
$

Tesla V100-PCIE, CUDA 10.1.243, CentOS 7

vsm2 · February 15, 2020, 12:45am

Okay. Thank you. I clearly see L1 caching happening.

Robert_Crovella · February 15, 2020, 2:12am

A modification of my test suggests that L2 caching is not happening:

$ cat t1662.cu
#include <iostream>

#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL
//#define MY_RESTRICT __restrict__
#define MY_RESTRICT

long long dtime_usec(unsigned long long start){

  timeval tv;
  gettimeofday(&tv, 0);
  return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}


template <typename T>
__global__ void k(const T * MY_RESTRICT d, T * MY_RESTRICT s, size_t n){

  T sum = 0;
  for (size_t i = 0; i < n; i++)
    sum += d[threadIdx.x + ((blockIdx.x+i)%gridDim.x)*blockDim.x];
  s[blockDim.x*blockIdx.x+threadIdx.x] = sum;
}

typedef float my_t;
const int my_n = 1000;
const int blks = 8*80;
const int nTPB = 256;
const int ds = blks*nTPB;
int main(){

  my_t *d_d, *d_s;
#ifdef USE_HOST
  cudaHostAlloc(&d_d, ds*sizeof(d_d[0]), cudaHostAllocDefault);
  cudaHostAlloc(&d_s, ds*sizeof(d_d[0]), cudaHostAllocDefault);
#else
  cudaMalloc(&d_d, ds*sizeof(d_d[0]));
  cudaMalloc(&d_s, ds*sizeof(d_s[0]));
#endif
  long long dt = dtime_usec(0);
  k<<<ds/nTPB, nTPB>>>(d_d, d_s, my_n);
  cudaDeviceSynchronize();
  dt = dtime_usec(dt);
  std::cout << "Duration: " << dt/(float)USECPSEC << "s" << std::endl;
  std::cout << "total bytes: " << sizeof(my_t)*my_n*ds << std::endl;
}


$ nvcc -o t1662 t1662.cu -DUSE_HOST
$ nvprof --metrics gld_transactions,gld_throughput,sysmem_read_transactions ./t1662
==28266== NVPROF is profiling process 28266, command: ./t1662
Duration: 0.064642s
total bytes: 655360000
==28266== Profiling application: ./t1662
==28266== Profiling result:
==28266== Metric result:
Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "Tesla V100-PCIE-32GB (0)"
    Kernel: void k<float>(float const *, float*, unsigned long)
          1                          gld_transactions                  Global Load Transactions    20480000    20480000    20480000
          1                            gld_throughput                    Global Load Throughput  11.489GB/s  11.489GB/s  11.489GB/s
          1                  sysmem_read_transactions           System Memory Read Transactions    20480000    20480000    20480000
$

I have modified my previous statement.

EtienneM · June 1, 2022, 2:52pm

Hi,
I do not understand why you do not memcpy the pinned host memory to gpu memory. I would expect the type of behavior under USE_HOST to work only when cudaHostAllocMapped is used.

Robert_Crovella · June 1, 2022, 3:12pm

because the purpose of this thread and discussion is to examine the behavior when host memory is mapped into the global space, not when the GPU is reading from device memory.

cudaHostAllocMapped is an implied/default flag for any pinned allocation in a UA regime. A UA regime is in effect for any instance of CUDA running in a 64-bit OS, for discrete CUDA GPUs.

For most practical development purposes, CUDA has only supported 64-bit OS for quite some time now.

Topic		Replies	Views
Unified memory oversubscription and page faults CUDA Programming and Performance	7	2808	March 23, 2018
Enabling Concurrent Managed Memory access on GPU CUDA Programming and Performance	10	86	February 12, 2025
Can I use Unified Memory in a soft real-time system? CUDA Programming and Performance	13	346	April 1, 2024
New to CUDA having memory transfer issues CUDA Programming and Performance	16	1989	April 18, 2017
Abysmal performance with Unified Memory and CUBLAS CUDA Programming and Performance	15	4292	November 29, 2014
A few general questions... CUDA Programming and Performance	2	3070	October 12, 2009
Maximizing Unified Memory Performance in CUDA Technical Blog	18	1221	May 14, 2019
Question about Pinned memory CUDA Programming and Performance	8	1891	June 16, 2016
Unified memory has slow bandwidth over NVLink 2.0 for large data sizes CUDA Programming and Performance	4	1077	September 25, 2019
Unified memory CUDA Programming and Performance	2	725	November 11, 2019

Unified Memory vs Pinned Host Memory vs GPU Global Memory

Related topics