Poor performance cudaHostUnregister

I’ve been working with the cudaHostRegister and cudaHostUnregister functions and I’ve noticed that the latter takes very long. Even when comparing cudaHostUnregister to a cudaMemcpy on the same data it takes a very long time, even without using page-locked memory for the memcpy.

I’ve made the following short program:

#include <stdio.h>

#include <time.h>

#include <assert.h>

#include <stdlib.h>

static struct timespec tp;

static clockid_t clk = CLOCK_REALTIME;

static void tu_timer_start(void)

{

  int res = clock_gettime(clk, &tp);

  assert(!res);

}

static long long tu_timer_stop(void)

{

  struct timespec tp_new;

  long long elapsed;

  int res = clock_gettime(clk, &tp_new);

assert(!res);

elapsed = 1000000000LL * (tp_new.tv_sec - tp.tv_sec) + tp_new.tv_nsec - tp.tv_nsec;

  tp = tp_new;

return elapsed;

}

int main() {

  const int length = 999424;

  const int pagesize = 4096;

// Allocating page-aligned host data and filling it with zeroes.

  int *paged, *locked;

  posix_memalign((void**) &paged, pagesize, length * sizeof(int));

  posix_memalign((void**) &locked, pagesize, length * sizeof(int));

  memset(paged, 0, length * sizeof(int));

  memset(locked, 0, length * sizeof(int));

// Allocating device data.

  int *devPaged, *devLocked;

  tu_timer_start();

  printf("%20d\n", cudaMalloc(&devPaged, length * sizeof(int)));

  printf("%20d\n", cudaMalloc(&devLocked, length * sizeof(int)));

  printf("Initialization:   %12lld ns\n", tu_timer_stop());

// Measure copy time with pageable data.

  tu_timer_start();

  printf("%20d\n", cudaMemcpy(devPaged, paged, length * sizeof(int), cudaMemcpyHostToDevice));

  printf("Copy pageable:    %12lld ns\n", tu_timer_stop());

// Measure time to page-lock host data.

  tu_timer_start();

  printf("%20d\n", cudaHostRegister(locked, length * sizeof(int), 0));

  printf("Host register:    %12lld ns\n", tu_timer_stop());

// Measure copy time with page-locked data.

  tu_timer_start();

  printf("%20d\n", cudaMemcpy(devLocked, locked, length * sizeof(int), cudaMemcpyHostToDevice));

  printf("Copy page-locked: %12lld ns\n", tu_timer_stop());

// Measure time to release page-lock on host data.

  tu_timer_start();

  cudaHostUnregister(locked);

  printf("Host unregister:  %12lld ns\n", tu_timer_stop());  

return 0;

}

This gives the following output on a quad-core Intel i5 760 (2.80 GHz per core) with a Tesla C2050 (with cuda return code not printed here):

Initialization:       81027005 ns

Copy pageable:         1263236 ns

Host register:          436132 ns

Copy page-locked:       706051 ns

Host unregister:       2139736 ns

This shows my problem. In my actual program it is even worse, I often measure cudaHostUnregister taking around 3460000 ns. This would suggest that it doesn’t play well with concurrent asynchronous memcopies or kernel runs,on top of being slow.

Why does this function take so long, and is there a way to speed it up? And does it really not work in parallel with memcopies and kernels, and if it does, why not?

Or is there simply a much better way of parallelizing memcopies and kernel runs?

You need to amortize the register/unregister calls over multiple memcpys.

I presume if the register/memcpy/unregister sequence were faster than just a single memcpy of unlocked memory, the CUDA libraries would use this method themselves.

Yeah, we know we have some perf bug related to cudaHostUnregister. Working on that for a future release. It’s never going to be unbelievably fast though (it’s a kernel call), so you still need to amortize it.

Also, we probably wouldn’t pin memory ourselves because that could surprise a user making a very large copy…

Thanks for your answers. I don’t quite get how you would amortize it over multiple memcopies though as whern I multiple the size by ten I get the following results:

Initialization:      140624485 ns

Copy pageable:         9020523 ns

Host register:         1508015 ns

Copy page-locked:      6528010 ns

Host unregister:      18603114 ns

Here the unregister call takes a little under ten times as long as well, so there is almost no gain from reduced overhead.

I think I’ll simply make a staging area in pinned memory, and first copy my data there before sending it of to the GPU. That’s over twice as fast and can be done without the GPU needing to synchronize. Maybe that is what you meant, but if not, could you explain?

I meant to reuse the same memory, not increasing the size of the mapped block. So yes, I your algorithms don’t allow this, you would need to use the mapped memory as a buffer through which you stage the data (which apparently also is what cudaMemcpy does).

If Nvidia could speed up mapping/unmapping up to the point where mapping on demand it is faster (for reasonable amounts of pinned memory) than staging through a buffer, this would certainly be a welcome improvement though. In cases where this leads to too much memory pinned at once it could then be broken up into multiple smaller transfers and still gain a speed advantage.

Maybe it would be possible to make an unsafe version of cudaHostUnregister that’s a lot faster, but can only be used when certain conditions are met (that the programmer has to check him/herself).

We’ll fix this eventually, although it’s not going to make the next release.