__threadfence() and page-locked host memory

boricworld · June 23, 2009, 12:15am

Hello everyone,

I understand that cuda memory fence function, __threadfence(), can enforce all threads wait until all global and shared memory accesses prior to it are visible to all threads in the device. My question is, for a block of page-lock host memory that has been mapped into device’s memory space, will this fence function work as well (since I don’t know whether this kind of memory should be classified as “global memory” or not)?

Your comments are highly appreciated

B

boricworld · June 23, 2009, 5:10pm

Another idea is that, for a block of mapped memory, should I explicitly use streams and cudaEventRecord(), in the kernel, and cudaEventQuery()/cudaEventSynchronize() to avoid all RAW, WAR, WAW hazards?

tmurray · June 23, 2009, 8:03pm

Are you trying to read memory written across PCIe from the same kernel or from a different kernel?

boricworld · June 23, 2009, 9:06pm

What I have in mind is that I have a block of mapped memory, such that both the host and the GPU can access it by pHost, and pDevice, individually. The memory block is set to all 0’s.

I have a kernel running at GPU with the following code segment (assuming there is only one block and one thread per block).

pDevice[0] = 1;

__threadfence();

pDevice[1] = 1;

Can this guarantee that the host will always see pDevice[0] set before pDevice[1]?

The programming guild does not explicitly specify whether this fence function works for mapped memory blocks.

tmurray · June 23, 2009, 9:25pm

No, it does not because you’re at the mercy of a lot of different things like the PCIe controller that may coalesce writes, caches, etc. I don’t know of any way to make this work robustly at the moment, but if you want to experiment, go for it…

boricworld · June 23, 2009, 9:35pm

Many thanks for the prompt response.

Since the above idea is not guaranteed, how about this one?

Having pDevice residing at the global memory of the GPU, and a single kernel running with the following code segment (again, assuming there is only one block and one thread per block).

pDevice[0] = 1;

__threadfence();

pDevice[1] = 1;

Can the host guarantee to be able to see pDevice[0] set before pDevice[1] (with cudaMemcpy() or cudaMemcpyAsync())?

tmurray · June 23, 2009, 9:37pm

Er, you’re going to need to give some example of how you’re planning to do this. How is the host going to see anything at that point?

boricworld · June 23, 2009, 9:47pm

OK. Here is a small example.

at Host

...

bool done = false;

cudaMalloc(&d_a, sizeof(int));

cudaMemset(d_a, 0, sizeof(int));

cudaMalloc(&d_b, sizeof(int));

cudaMemset(d_b, 0, sizeof(int));

myKernel<<<grid, block, 0 , 0>>>(d_a, d_b);

do {

  cudaMemcpyAsync(&b, &d_b, sizeof(int), cudaMemcpyDeviceToHost, 0);

  cudaEventRecord(event1, 0);

  while (cudaEventQuery(event1) == cudaErrorNotReady);

  if (b == 1)

	done = true;

} while (!done)

cudaMemcpyAsync(&a, &d_a, sizeof(int), cudaMemcpyDeviceToHost, 0); 

cudaEventRecord(event2, 0);

while (cudaEventQuery(event2) == cudaErrorNotReady)

x = a;  /* Q: can a here guarantee to be "1", always? */

....

at GPU

...

d_a = 1;

__threadfence();

d_b = 1;

...

tmurray · June 23, 2009, 9:56pm

I’m going to go with no for a variety of reasons (async operations are undefined when they access memory currently being accessed by a kernel so whatever I say could change in the future, __threadfence only affects one path to memory, I don’t know if PCIe takes the same path).

boricworld · June 23, 2009, 10:04pm

Again, thanks for your comments.

Actually I doubt if __threadfence() works this way as well. If one day it really does, it would be very expensive since it has to monitor/control all paths to the memory.