cudaHostAllocWriteCombined vs cudaHostAllocMapped

chung · October 12, 2014, 5:00pm

Does the term zero-copy refer to write-combined memory or host-mapped memory? I’m reading Chapter 11 of “CUDA by example” and I have a few questions…

Is it OK to do something like the following for variable host_p that is declared as cudaHostAllocMapped but not cudaHostAllocWriteCombined:

int main(void) {

 int * host_p; /*< Host data allocated as pinned memory */
 int * dev_ptr_p; /*< this pointer resides on the host */
 int ns = 32;
 int data_size = ns * sizeof(int);

 /* Allocate host_p as cudaHostAllocMapped */
 checkCudaErrors(
   cudaHostAlloc((void**) &host_p, data_size, cudaHostAllocMapped));

 for (int i = 0; i < ns; i++)
  host_p[i] = i + 1;

 /* Get the device-side pointer of host_p and pass it to the kernel */
 checkCudaErrors(cudaHostGetDevicePointer(&dev_ptr_p, host_p, 0));
 kernel<<<1, ns>>>(dev_ptr_p);
 checkCudaErrors(cudaDeviceSynchronize());

 for (int i = 0; i < ns; i++)
  printf("host_p[%d] = %d\n", i, host_p[i]);

 checkCudaErrors(cudaFreeHost(host_p));
 
 return 0;
}

and one more question in regard to how mapped memory functions: If I only do

cudaHostAlloc((void**) &host_p, data_size, cudaHostAllocMapped)

and I don’t all a kernel or do anything on the device, then no data will be transferred to the device, right?

Data, I assume, are transferred only when I invoke the kernel function. Does the same hold for device-to-host transactions? For instance, in the code above, if we omit lines 20-21 (we don’t access host_p any more from the host), will the device update the value of host_p on the host?

It seems that there are some implicit data transactions from the host to the device and vice-versa and it’s not very clear how the whole thing works.

Robert_Crovella · October 12, 2014, 5:32pm

zero copy means mapped

Yes, if you only do the cudaHostAlloc operation, no data is transferred. Data is only transferred from the host to the device when the kernel reads the host mapped data. If the kernel writes to the host-mapped data, then (eventually) a write transaction is triggered to host memory, regardless of what the CPU thread does.

chung · October 13, 2014, 1:20am

Thanks a lot for the answer!

But if a variable is allocated with cudaHostAlloc and the flag cudaHostAllocWriteCombined, its address is passed to a kernel (that only reads from this variable) and, more, the host doesn’t read that variable any more, then are any data transferred from the device to the host? For instance, consider this example:

__global__ void kernel(const int *da, int *db) {
	int tid =threadIdx.x + blockIdx.x * blockDim.x;
	db[tid] = da[tid] + 10;
}

int main(void) {

	int * p;
	int * q;
	int * dev_p;
	int * dev_q;
	int ns = 64;
	int size = ns * sizeof(int);

	/* Allocate p as zero-copy write-combined memory */
	checkCudaErrors(
           cudaHostAlloc(&p, size,
	   cudaHostAllocWriteCombined | cudaHostAllocMapped));

	/* Allocate q as zero-copy memory (not write-combined) */
	checkCudaErrors(
	   cudaHostAlloc(&q, size, cudaHostAllocMapped));

	/* Initialize p */
	for (int i = 0; i < ns; i++){ p[i] = i + 1; }

	/* Get the device pointers for p and q */
	checkCudaErrors( cudaHostGetDevicePointer(&dev_p, p, 0));
	checkCudaErrors( cudaHostGetDevicePointer(&dev_q, q, 0));

	kernel<<<ns/2, ns>>>(dev_p, dev_q); // Launch the kernel
	checkCudaErrors(cudaDeviceSynchronize());

	for (int i = 0; i < ns; i++){ printf("q[%d] = %d\n", i, q[i]); }

	return 0;
}

Here the values of p don’t change device-side; will p be transferred back to the host? (It wouldn’t make any sense of course!)

Robert_Crovella · October 13, 2014, 3:06pm

At the completion of the kernel (at least), the data written to by the kernel in the mapped space will be transferred back to the CPU memory. This is true regardless of the flags passed to cudaHostAlloc.

If no data in the mapped space is written to by the kernel, then no data will be written back to CPU memory.

I feel like I am repeating myself.

chung · October 13, 2014, 6:18pm

OK, got it! Thanks a lot txbob!

adam.zborovszky · August 19, 2019, 2:08pm

As I understood - when using zero-copy memory, the device reads/writes directly to the host memory through the PCIe bus. There is no device instance of the allocated memory area and there are no explicit memory transfers like in case of the page-able and pinned memory model.

This explains why “data transfers” are triggered by read/write operations.

Topic		Replies	Views
problem with zero-copy write to write-combined memory. CUDA Programming and Performance	2	727	September 20, 2014
Synchronizing mapped memory use between host and device while kernel is running CUDA Programming and Performance	16	13209	June 4, 2010
Does pinned memory can accessed by Device? CUDA Programming and Performance	4	1294	March 18, 2024
Difference between cudaMallocManaged and zero copy memory function CUDA Programming and Performance	1	5766	March 1, 2018
CUDA Zero Copy On TX1 Jetson TX1	20	6827	October 18, 2021
"What is the best practice for CUDA data transfer (CUDA 12.5)?" CUDA Programming and Performance	4	111	September 2, 2024
does anybody have experience on cudaHostRegister zero copy memory CUDA Programming and Performance	8	14448	May 21, 2011
gpu access host memory CUDA Programming and Performance	1	638	January 20, 2012
cudaHostGetDevicePointer() and Zero-Copy CUDA Programming and Performance	7	9926	May 1, 2009
Zero-copy: is cudaSetDeviceFlags(cudaDeviceMapHost) actually needed? CUDA Programming and Performance cuda	5	293	February 24, 2025

cudaHostAllocWriteCombined vs cudaHostAllocMapped

Related topics