Hallo CUDA hackers,
I have a query on CUDA memory consistency model across kernel invocations
w.r.t. the GPU-GPU synchronization. It may be a stupid question on my part, but it seems
that device global memory consistency/coherence is not always ensured, despite the docs
claiming that there are no caches for device global memory.
I am using the CUDA 2.0 on Linux (8800GTS 512 as Device#1 with no monitor attached,
Device#0 does have a monitor) with C using the low-level CUDA Driver API:
The setup is as follows:
- Wait_kernel:
__global__ void wait_kernel(const volatile int buf[1],int out[1]) {
while (!buf[0]);
out[0]=buf[0]+1;
}
- Trigger_kernel:
__global__ void trigger_kernel(volatile int buf[1]) {
buf[0]=125;
}
-
main.c, which after the usual DriverAPI stuff, does roughly this (“w” is kernel for wait_kernel
and “t” is kernel for trigger_kernel, “b” and “o” are pointers to 1 integer in page-locked
memory, “buf” and “out” are devicepointers to 1 integer):
StreamCreate(&wait,0);StreamCreate(&trig,0);StreamCreate(&root,0);
/* start wait...*/
launch(wait, &w, buf,out);
for (int i=0; i<1000; i++) MemcpyDtoHAsync(o,out,sizeof(int),root);
/* start trigger...*/
launch(trig, &t, buf);
for (i=0; i<1000 && !o[0]; i++) MemcpyDtoHAsync(o,out,sizeof(int),root);
/* make it stop */ b[0]=100; MemcpyHtoDAsync(buf,b,sizeof(int),root);
StreamSynchronize(root);
I expected this code to result in o[0]==126, since the trigger_kernel is
supposed to run asynchronously to the wait_kernel and to the host code. However,
it doesn’t result in expected behaviour - without “make it stop” line, the host thread
and the wait_kernel are dead-locked. With “make it stop” line, the result is o[0]==0
and out[0]==“101” (as expected).
Putting StreamSynchronize(trig) (or anything else that blocks on trig stream) after
trigger_kernel launch anywhere results in similar dead-lock.
When I replaced the trigger_kernel launch by host code as follows:
b[0]=120; MemcpyHtoDAsync(buf,b,sizeof(int),root);
which does result in correct o[0]==121 (*).
Also, launching the trigger_kernel before wait_kernel does result in correct behaviour (**).
From all this I infer:
-
synchronization with the host via global memory is possible (*)
-
device memory sharing between kernels running in different streams in one application is
possible (**)
Given this, I expected that it would also work with the synchronization flag (“buf”)
allocated/initialized/read on the device global memory, but it smells like there is something fishy there…
Anybody else with similar problems? Or am I doing something really stupid?
P.S. I know the docs say that there is no synchronization mechanism between blocks (other than
device global memory). This does mean that I could always use device global memory for
inter-block/kernel synchronization, no?