cudaHostAllocWriteCombined vs cudaHostAllocMapped

Does the term zero-copy refer to write-combined memory or host-mapped memory? I’m reading Chapter 11 of “CUDA by example” and I have a few questions…

Is it OK to do something like the following for variable host_p that is declared as cudaHostAllocMapped but not cudaHostAllocWriteCombined:

int main(void) {

 int * host_p; /*< Host data allocated as pinned memory */
 int * dev_ptr_p; /*< this pointer resides on the host */
 int ns = 32;
 int data_size = ns * sizeof(int);

 /* Allocate host_p as cudaHostAllocMapped */
   cudaHostAlloc((void**) &host_p, data_size, cudaHostAllocMapped));

 for (int i = 0; i < ns; i++)
  host_p[i] = i + 1;

 /* Get the device-side pointer of host_p and pass it to the kernel */
 checkCudaErrors(cudaHostGetDevicePointer(&dev_ptr_p, host_p, 0));
 kernel<<<1, ns>>>(dev_ptr_p);

 for (int i = 0; i < ns; i++)
  printf("host_p[%d] = %d\n", i, host_p[i]);

 return 0;

and one more question in regard to how mapped memory functions: If I only do

cudaHostAlloc((void**) &host_p, data_size, cudaHostAllocMapped)

and I don’t all a kernel or do anything on the device, then no data will be transferred to the device, right?

Data, I assume, are transferred only when I invoke the kernel function. Does the same hold for device-to-host transactions? For instance, in the code above, if we omit lines 20-21 (we don’t access host_p any more from the host), will the device update the value of host_p on the host?

It seems that there are some implicit data transactions from the host to the device and vice-versa and it’s not very clear how the whole thing works.

zero copy means mapped

Yes, if you only do the cudaHostAlloc operation, no data is transferred. Data is only transferred from the host to the device when the kernel reads the host mapped data. If the kernel writes to the host-mapped data, then (eventually) a write transaction is triggered to host memory, regardless of what the CPU thread does.

Thanks a lot for the answer!

But if a variable is allocated with cudaHostAlloc and the flag cudaHostAllocWriteCombined, its address is passed to a kernel (that only reads from this variable) and, more, the host doesn’t read that variable any more, then are any data transferred from the device to the host? For instance, consider this example:

__global__ void kernel(const int *da, int *db) {
	int tid =threadIdx.x + blockIdx.x * blockDim.x;
	db[tid] = da[tid] + 10;

int main(void) {

	int * p;
	int * q;
	int * dev_p;
	int * dev_q;
	int ns = 64;
	int size = ns * sizeof(int);

	/* Allocate p as zero-copy write-combined memory */
           cudaHostAlloc(&p, size,
	   cudaHostAllocWriteCombined | cudaHostAllocMapped));

	/* Allocate q as zero-copy memory (not write-combined) */
	   cudaHostAlloc(&q, size, cudaHostAllocMapped));

	/* Initialize p */
	for (int i = 0; i < ns; i++){ p[i] = i + 1; }

	/* Get the device pointers for p and q */
	checkCudaErrors( cudaHostGetDevicePointer(&dev_p, p, 0));
	checkCudaErrors( cudaHostGetDevicePointer(&dev_q, q, 0));

	kernel<<<ns/2, ns>>>(dev_p, dev_q); // Launch the kernel

	for (int i = 0; i < ns; i++){ printf("q[%d] = %d\n", i, q[i]); }

	return 0;

Here the values of p don’t change device-side; will p be transferred back to the host? (It wouldn’t make any sense of course!)

At the completion of the kernel (at least), the data written to by the kernel in the mapped space will be transferred back to the CPU memory. This is true regardless of the flags passed to cudaHostAlloc.

If no data in the mapped space is written to by the kernel, then no data will be written back to CPU memory.

I feel like I am repeating myself.

OK, got it! Thanks a lot txbob!

As I understood - when using zero-copy memory, the device reads/writes directly to the host memory through the PCIe bus. There is no device instance of the allocated memory area and there are no explicit memory transfers like in case of the page-able and pinned memory model.

This explains why “data transfers” are triggered by read/write operations.