cuda managed storage slow to read on host

Reading in nonsequential order a 1M element managed array on the host is over twice as slow as reading a nonmanaged array. Sequentially reading the array shows no such problem, suggesting that the page management is thrashing. However 1M elements is only 1/64000 of the real memory on the host. What’s going on and can I fix it? Details:

The host is a dual 14-core Intel Xeon with 256GB of memory. The device is an Nvidia GeForce GTX 1080 Pascal. CUDA is 8.0 and the compute capability is 6.1. The problem is to multiply two 1000x1000 matrices. The input matrices are either managed or nonmanaged. The result is always nonmanaged. Note that initializing the managed matrices is a little slower, but multiplying them is much slower.

Note that there is no kernel at all. There is no reason ever to copy the data to the GPU. If the arrays were created on the GPU, then they should be copied to the host the first time they’re read, and they should stay on the host. What am I missing? Thanks.

I created a separate example processing a 1M element array sequentially, and whether the array was managed did not affect the time. That’s why I think this is a memory thrashing problem.

// test managed memory speed.

#include <time.h>
#include <iostream>

using std::cout;
using std::endl;
using std::string;

#define TIME(arg) ( (arg), Delta_Clock_Time(#arg) )

const int n=1000;
__device__ __managed__  float am[n][n], bm[n][n];
float a[n][n], b[n][n], c[n][n];

void Delta_Clock_Time(const string &msg)  {
  timespec tp;
  double   ClockTotTime;
  static double Clockbasetime;
  static double old_time = 0.0;
  double  delta;
  static bool first = true;
  clock_gettime(CLOCK_MONOTONIC_RAW, &tp);
  ClockTotTime = tp.tv_sec + tp.tv_nsec*1e-9;
  if (first) {
  Clockbasetime = ClockTotTime;
  first = false;
  }
  ClockTotTime -= Clockbasetime;
  delta = ClockTotTime - old_time;
  old_time = ClockTotTime;
  cout << "Elapsed time for " << msg << ": " << delta << endl;
}

Output:

Running testmanaged3 with n= 1000
Elapsed time for {}: 0
Elapsed time for init(): 0.0380327
Elapsed time for init_managed(): 0.467357
Elapsed time for init(): 0.0143631
Elapsed time for init_managed(): 0.0152109
Elapsed time for seqmul(): 0.994104
Elapsed time for seqmul_managed(): 2.89084
Elapsed time for seqmul_managed(): 2.89027
7.53s real  6.89s user  0.64s system  99% 0,0 socket  112 mem ./testmanaged3

Note that I ran the functions twice, as people advise. Rerunning the program can change the multiplication times 10%, but reading the managed arrays is always over twice as slow.

Output from nvprof:

% nvprof testmanaged3   
Running testmanaged3 with n= 1000
Elapsed time for {}: 0
Elapsed time for init(): 0.0380293
==43092== NVPROF is profiling process 43092, command: testmanaged3
Elapsed time for init_managed(): 0.663673
Elapsed time for init(): 0.0280011
Elapsed time for init_managed(): 0.0258395
Elapsed time for seqmul(): 1.25066
Elapsed time for seqmul_managed(): 2.89211
Elapsed time for seqmul_managed(): 2.89145
==43092== Profiling application: testmanaged3
==43092== Profiling result:
No kernels were profiled.

==43092== Unified Memory profiling result:
Device "GeForce GTX 1080 (0)"
   Count  Avg Size  Min Size  Max Size  Total Size  Total Time  Name
  48  162.83KB  4.0000KB  0.9961MB  7.632813MB  656.0640us  Device To Host
  17         -         -         -           -  1.409985ms  GPU Page fault groups
Total CPU Page faults: 24

==43092== API calls:
Time(%)      Time     Calls       Avg       Min       Max  Name
 99.73%  456.70ms         1  456.70ms  456.70ms  456.70ms  cuDevicePrimaryCtxRetain
  0.14%  621.71us        91  6.8320us     324ns  289.81us  cuDeviceGetAttribute
  0.12%  529.64us         1  529.64us  529.64us  529.64us  cuDeviceTotalMem
  0.01%  41.754us         1  41.754us  41.754us  41.754us  cuDeviceGetName
  0.01%  33.781us        32  1.0550us     419ns  2.9270us  cuModuleGetFunction
  0.00%  19.496us        13  1.4990us     511ns  8.7430us  cuModuleGetGlobal
  0.00%  4.4190us         3  1.4730us     410ns  3.4780us  cuDeviceGetCount
  0.00%  2.4370us         1  2.4370us  2.4370us  2.4370us  cuCtxSetCurrent
  0.00%  2.0400us         3     680ns     349ns  1.0310us  cuDeviceGet
  0.00%  1.3010us         2     650ns     536ns     765ns  cuCtxGetCurrent
  0.00%     417ns         1     417ns     417ns     417ns  cuCtxGetDevice
8.14s real  7.27s user  0.74s system  98% 0,0 socket  152 mem nvprof testmanaged3

The OS is

Linux host 4.8.0-34-generic #36~16.04.1-Ubuntu SMP Wed Dec 21 18:55:08 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

I compiled the program thus:

nvcc -arch=compute_61 -O1 -I/local/cuda/include testmanaged3.cu -o testmanaged3

I posted this question on stackoverflow on Feb 20, but didn’t get any specific info, just some guesses.

Thanks.

Reading in nonsequential order also prevents the OS readahead from hiding the latency of page transfer over PCIe. I.e. if you read sequentially, additional pages will already be transferred while you are still reading earlier ones. With random access you will have to wait for the page to be transferred whenever you touch a new page.

I’d consider this a more likely explanation than thrashing.

Very sorry, but my code about is truncated, deleting all the interesting stuff. Here’s my MWE again,

// test managed memory speed.

#include <time.h>
#include <iostream>

using std::cout;
using std::endl;
using std::string;

#define TIME(arg) ( (arg), Delta_Clock_Time(#arg) )

const int n=1000;
__device__ __managed__  float am[n][n], bm[n][n];
float a[n][n], b[n][n], c[n][n];

void Delta_Clock_Time(const string &msg)  {
  timespec tp;
  double   ClockTotTime;
  static double Clockbasetime;
  static double old_time = 0.0;
  double  delta;
  static bool first = true;
  clock_gettime(CLOCK_MONOTONIC_RAW, &tp);
  ClockTotTime = tp.tv_sec + tp.tv_nsec*1e-9;
  if (first) {
Clockbasetime = ClockTotTime;
first = false;
  }
  ClockTotTime -= Clockbasetime;
  delta = ClockTotTime - old_time;
  old_time = ClockTotTime;
  cout << "Elapsed time for " << msg << ": " << delta << endl;
}

void init() {
  srand48(5L);
  for (int i=0; i<n; i++)
for (int j=0; j<n; j++) {
  a[i][j] = drand48();
  b[i][j] = drand48();
}
}

void init_managed() {
  srand48(5L);
  for (int i=0; i<n; i++)
for (int j=0; j<n; j++) {
  am[i][j] = drand48();
  bm[i][j] = drand48();
}
}

void seqmul() {
  for (int i=0; i<n; i++)
for (int j=0; j<n; j++) {
  c[i][j] = 0;
  for (int k=0; k<n; k++)
    c[i][j] += a[i][k]*b[k][j];
}
}


void seqmul_managed() {
  for (int i=0; i<n; i++)
for (int j=0; j<n; j++) {
  c[i][j] = 0;
  for (int k=0; k<n; k++)
    c[i][j] += am[i][k]*bm[k][j];
}
}

int main(void) {
 cout << "Running testmanaged3 with n= " << n << endl;
  TIME({});
  TIME(init());
  TIME(init_managed());
  TIME(init());
  TIME(init_managed());
  TIME(seqmul());
  TIME(seqmul_managed());
  TIME(seqmul_managed());
}

I never call any CUDA function, and so what errors would I check for?

I suppose that device managed variables are initially allocated on the device. However copying them to the host should take very little time. Indeed that is what the debugging output shows.

Also the total amount of space that my variables use is an infinitesimal amount of the available storage on either the host or the device.

Absent a more detailed description of how managed storage is implemented, it appears I’ve discovered that it uses a very small undocumented cache. Please tell me I’m wrong. Where would I learn about other performance killing features?

You might be wondering what motivated me to try this.

First, even with this apparent implementation problem, managed variables save me a lot of time.

Second, this example is still 15x faster than single-threaded host code.

Third, this suggests that having all malloc’ed memory automatically managed will have similar problems.

BTW, The 4.10 linux kernel is now available. Has anyone got it working with managed variables? I’ve tried and failed on 2 machines. It appears that 4.10 changed the API for some kernel function that zfs uses, so the zfs drivers won’t compile.