Unified Memory vs Pinned Host Memory vs GPU Global Memory

My GPU memory is far too small for a particular problem.
If I use Pinned Host Memory or Unified Memory will GPU threads be able to read/write directly from/to CPU memory or does the GPU global memory still come into play as a staging area?
Are there any limitations on the CPU RAM size; e.g. will the GPU be able access 64GB of CPU RAM?

Unified memory is a programming construct that allows the distinction between host and device memory to be somewhat obscured. But allocations in unified memory must still fit entirely within the available memory of the GPU, to be usable in device code. It will not allow you to “expand” the memory available to the GPU, beyond what is “on board”.

Pinned memory should allow you to expand the memory available to the GPU. There are many specific details which must be attended to. It will depend on what kind of GPU you have, as well as what operating system you are running, especially whether it is 32-bit or 64-bit. Furthermore, pinning memory “removes” it from the host demand-paged virtual memory system. I have successfully pinned 32GB in a 48GB host machine, but pinning 64GB will require significantly more than 64GB in your host machine (as well as other specifics).

Finally, pinned memory at best can achieve an access bandwidth approximately equal to your PCI Express bandwidth (discoverable using the bandwidthTest utility). Such bandwidth is generally quite a bit lower than what is available directly to the on-board GPU memory. So there is no free lunch.

EDIT: The first paragraph was correct when it was written in the CUDA 6 timeframe when unified memory was first introduced. With the advent of CUDA 8, oversubscription of GPU memory became possible in some settings. So in these settings, a managed allocation can exceed the size of the available physical “on board” GPU RAM.

Thanks for the response.
If I were to use Pinned Memory I cannot believe there would be a cache mechanism (right?), how about any relief from coalescing?

GPU L1 and L2 caches should have the same behavior on pinned memory that is mapped as global memory, as any other type of global memory.

You would certainly want to use coalescing as much as possible, just as in any GPU code.

Is this (pinned memory allocation in the host) still true in the case of Volta/Turing Architecture?

Because when I profile with nvprof/nsight compute, I do not see any L2 caching occurring for matrix A and B in the tiled matrix multiplication algorithm. If the data resides in the GPU memory (cudaMalloc) then I have 98% hit rate.

The following application reads the same set of data over and over again. The loads are constructed to prevent the compiler from optimizing them out. We are reading 4GB of data, and doing so in less than 10ms. If the loads were happening across the PCIE bus this would take approximately 0.3s. Therefore I claim there is caching.

$ cat t1662.cu
#include <iostream>

#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL
long long dtime_usec(unsigned long long start){

  timeval tv;
  gettimeofday(&tv, 0);
  return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}


template <typename T>
__global__ void k(const T * __restrict__ d, T * __restrict__ s, size_t n){

  T sum = 0;
  for (size_t i = 0; i < n; i++)
    sum += d[blockDim.x*blockIdx.x+((threadIdx.x+i)%blockDim.x)];
  s[blockDim.x*blockIdx.x+threadIdx.x] = sum;
}

typedef float my_t;
const int my_n = 1000;
const int ds = 1048567;
const int nTPB = 256;
int main(){

  my_t *d_d, *d_s;
#ifdef USE_HOST
  cudaHostAlloc(&d_d, ds*sizeof(d_d[0]), cudaHostAllocDefault);
  cudaHostAlloc(&d_s, ds*sizeof(d_d[0]), cudaHostAllocDefault);
#else
  cudaMalloc(&d_d, ds*sizeof(d_d[0]));
  cudaMalloc(&d_s, ds*sizeof(d_s[0]));
#endif
  long long dt = dtime_usec(0);
  k<<<ds/nTPB, nTPB>>>(d_d, d_s, my_n);
  cudaDeviceSynchronize();
  dt = dtime_usec(dt);
  std::cout << "Duration: " << dt/(float)USECPSEC << "s" << std::endl;
  std::cout << "total bytes: " << sizeof(my_t)*my_n*ds << std::endl;
}


$ nvcc -o t1662 t1662.cu -DUSE_HOST
$ ./t1662
Duration: 0.003557s
total bytes: 4194268000
$ nvprof --metrics gld_transactions,gld_throughput ./t1662
==23709== NVPROF is profiling process 23709, command: ./t1662
Duration: 0.011846s
total bytes: 4194268000
==23709== Profiling application: ./t1662
==23709== Profiling result:
==23709== Metric result:
Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "Tesla V100-PCIE-32GB (0)"
    Kernel: void k<float>(float const *, float*, unsigned long)
          1                          gld_transactions                  Global Load Transactions   159705000   159705000   159705000
          1                            gld_throughput                    Global Load Throughput  1378.3GB/s  1378.3GB/s  1378.3GB/s
$

Tesla V100-PCIE, CUDA 10.1.243, CentOS 7

Okay. Thank you. I clearly see L1 caching happening.

A modification of my test suggests that L2 caching is not happening:

$ cat t1662.cu
#include <iostream>

#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL
//#define MY_RESTRICT __restrict__
#define MY_RESTRICT

long long dtime_usec(unsigned long long start){

  timeval tv;
  gettimeofday(&tv, 0);
  return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}


template <typename T>
__global__ void k(const T * MY_RESTRICT d, T * MY_RESTRICT s, size_t n){

  T sum = 0;
  for (size_t i = 0; i < n; i++)
    sum += d[threadIdx.x + ((blockIdx.x+i)%gridDim.x)*blockDim.x];
  s[blockDim.x*blockIdx.x+threadIdx.x] = sum;
}

typedef float my_t;
const int my_n = 1000;
const int blks = 8*80;
const int nTPB = 256;
const int ds = blks*nTPB;
int main(){

  my_t *d_d, *d_s;
#ifdef USE_HOST
  cudaHostAlloc(&d_d, ds*sizeof(d_d[0]), cudaHostAllocDefault);
  cudaHostAlloc(&d_s, ds*sizeof(d_d[0]), cudaHostAllocDefault);
#else
  cudaMalloc(&d_d, ds*sizeof(d_d[0]));
  cudaMalloc(&d_s, ds*sizeof(d_s[0]));
#endif
  long long dt = dtime_usec(0);
  k<<<ds/nTPB, nTPB>>>(d_d, d_s, my_n);
  cudaDeviceSynchronize();
  dt = dtime_usec(dt);
  std::cout << "Duration: " << dt/(float)USECPSEC << "s" << std::endl;
  std::cout << "total bytes: " << sizeof(my_t)*my_n*ds << std::endl;
}


$ nvcc -o t1662 t1662.cu -DUSE_HOST
$ nvprof --metrics gld_transactions,gld_throughput,sysmem_read_transactions ./t1662
==28266== NVPROF is profiling process 28266, command: ./t1662
Duration: 0.064642s
total bytes: 655360000
==28266== Profiling application: ./t1662
==28266== Profiling result:
==28266== Metric result:
Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "Tesla V100-PCIE-32GB (0)"
    Kernel: void k<float>(float const *, float*, unsigned long)
          1                          gld_transactions                  Global Load Transactions    20480000    20480000    20480000
          1                            gld_throughput                    Global Load Throughput  11.489GB/s  11.489GB/s  11.489GB/s
          1                  sysmem_read_transactions           System Memory Read Transactions    20480000    20480000    20480000
$

I have modified my previous statement.