problem with zero-copy write to write-combined memory.


I’m trying to zero-copy write the result of my kernel to host memory, which is allocated with cudaHostAlloc. With the cudaHostAllocDefault flags everything works but with the cudaHostAllocWriteCombined flag it breaks down, unless I use cudaHostGetDevicePointer.

So basically the question boils down to why do I need cudaHostGetDevicePointer and a separate device pointer when using write-combined memory. I’m using the 6.0 toolkit and a 2.0 device.

If necessary I can post some sample code.

To illustrate the issue, here is a little program showing the problem. It might be that I’m overlooking some addressing mode or something like that.

The output of this program:

__global__ void kernel(int *value) {
  printf("kernel value = %p\n", value);
  *value = 2;

int main() {

  int *value = NULL, *devValue = NULL;
  cudaDeviceProp prop;

  cudaGetDeviceProperties(&prop, 0); 

  printf("prof.unifiedAddressing = %i\n", prop.unifiedAddressing);

  unsigned int flags = cudaHostAllocDefault;

  cudaHostAlloc((void **)&value, sizeof(int), flags);
  cudaHostGetDevicePointer(&devValue, value, 0); 

  printf("value = %p, devValue = %p\n", value, devValue);

  *value = 1;

  printf("value = %i\n", *value);

  kernel<<<1, 1>>>(value);
  printf("value = %i\n", *value);



is as expected, with cuda-memcheck not giving any errors:

prof.unifiedAddressing = 1
value = 0x200100000, devValue = 0x200100000
value = 1
kernel value = 0x200100000
value = 2
========= ERROR SUMMARY: 0 errors

Changing the flag on line 17 to cudaHostAllocWriteCombined, changes the value and devValue pointers to

value = 0x2b7378b1f000, devValue = 0x200100000

Which then obviously creates errors when the kernel is being executed:

========= Invalid __global__ write of size 4
=========     at 0x00000148 in /direct/nanofun/mdereus/projects/gpuMatrixMem/*)
=========     by thread (0,0,0) in block (0,0,0)
=========     Address 0x2b7378b1f000 is out of bounds

I don’t understand why in the first situation the pointers are equal and in the second they are different.

Apparently it is simply necessary to use cudaHostGetDevicePointer, since write-combined memory has different host and device pointers. See the end of the Unified Addressing paragraph of the CUDA Runtime API documantation, so this solves the issue.