__threadfence() and page-locked host memory

Hello everyone,

I understand that cuda memory fence function, __threadfence(), can enforce all threads wait until all global and shared memory accesses prior to it are visible to all threads in the device. My question is, for a block of page-lock host memory that has been mapped into device’s memory space, will this fence function work as well (since I don’t know whether this kind of memory should be classified as “global memory” or not)?

Your comments are highly appreciated


Another idea is that, for a block of mapped memory, should I explicitly use streams and cudaEventRecord(), in the kernel, and cudaEventQuery()/cudaEventSynchronize() to avoid all RAW, WAR, WAW hazards?

Are you trying to read memory written across PCIe from the same kernel or from a different kernel?

What I have in mind is that I have a block of mapped memory, such that both the host and the GPU can access it by pHost, and pDevice, individually. The memory block is set to all 0’s.

I have a kernel running at GPU with the following code segment (assuming there is only one block and one thread per block).

pDevice[0] = 1;


pDevice[1] = 1;

Can this guarantee that the host will always see pDevice[0] set before pDevice[1]?

The programming guild does not explicitly specify whether this fence function works for mapped memory blocks.

No, it does not because you’re at the mercy of a lot of different things like the PCIe controller that may coalesce writes, caches, etc. I don’t know of any way to make this work robustly at the moment, but if you want to experiment, go for it…

Many thanks for the prompt response.

Since the above idea is not guaranteed, how about this one?

Having pDevice residing at the global memory of the GPU, and a single kernel running with the following code segment (again, assuming there is only one block and one thread per block).

pDevice[0] = 1;


pDevice[1] = 1;

Can the host guarantee to be able to see pDevice[0] set before pDevice[1] (with cudaMemcpy() or cudaMemcpyAsync())?

Er, you’re going to need to give some example of how you’re planning to do this. How is the host going to see anything at that point?

OK. Here is a small example.

at Host


bool done = false;

cudaMalloc(&d_a, sizeof(int));

cudaMemset(d_a, 0, sizeof(int));

cudaMalloc(&d_b, sizeof(int));

cudaMemset(d_b, 0, sizeof(int));

myKernel<<<grid, block, 0 , 0>>>(d_a, d_b);

do {

  cudaMemcpyAsync(&b, &d_b, sizeof(int), cudaMemcpyDeviceToHost, 0);

  cudaEventRecord(event1, 0);

  while (cudaEventQuery(event1) == cudaErrorNotReady);

  if (b == 1)

	done = true;

} while (!done)

cudaMemcpyAsync(&a, &d_a, sizeof(int), cudaMemcpyDeviceToHost, 0); 

cudaEventRecord(event2, 0);

while (cudaEventQuery(event2) == cudaErrorNotReady)

x = a;  /* Q: can a here guarantee to be "1", always? */


at GPU


d_a = 1;


d_b = 1;


I’m going to go with no for a variety of reasons (async operations are undefined when they access memory currently being accessed by a kernel so whatever I say could change in the future, __threadfence only affects one path to memory, I don’t know if PCIe takes the same path).

Again, thanks for your comments.

Actually I doubt if __threadfence() works this way as well. If one day it really does, it would be very expensive since it has to monitor/control all paths to the memory.