cudaHostRegister and interior pointers

If a block of memory is registered via cudaHostRegister, will it speed up cudaMemcpy operations to any part of the block, or only copies where the address passed to cudaMemcpy* is the same address as passed to cudaHostRegister?

That is

void * ptr = malloc(4096);
void *dev; cudaMalloc(&dev, 4096);

cudaHostRegister(ptr, 4096, cudaHostRegisterDefault);

cudaMemcpy(ptr, dev, 2048, cudaMemcpyDeviceToHost); // accelerated
cudaMemcpy(ptr + 2048, dev+2048, 2048, cudaMemcpyDeviceToHost); // accelerated ???
  1. Will the second cudaMemcpy call recognize the memory is registered?
  2. Is there any (substantive) penalty to using the interior pointer?


It affects any part of the block and all parts of the block. The second cudaMemcpy call will recognize the memory as registered/pinned.