GPU-CPU & GPU-GPU synchronization query on advanced CUDA features

Hallo CUDA hackers,

I have a query on CUDA memory consistency model across kernel invocations

w.r.t. the GPU-GPU synchronization. It may be a stupid question on my part, but it seems

that device global memory consistency/coherence is not always ensured, despite the docs

claiming that there are no caches for device global memory.

I am using the CUDA 2.0 on Linux (8800GTS 512 as Device#1 with no monitor attached,

Device#0 does have a monitor) with C using the low-level CUDA Driver API:

The setup is as follows:

  1. Wait_kernel:
__global__ void wait_kernel(const volatile int buf[1],int out[1])  {       

          while (!buf[0]);


  1. Trigger_kernel:
__global__ void trigger_kernel(volatile int buf[1])  {


  1. main.c, which after the usual DriverAPI stuff, does roughly this (“w” is kernel for wait_kernel

    and “t” is kernel for trigger_kernel, “b” and “o” are pointers to 1 integer in page-locked

    memory, “buf” and “out” are devicepointers to 1 integer):


/* start wait...*/

launch(wait, &w, buf,out);

for (int i=0; i<1000; i++) MemcpyDtoHAsync(o,out,sizeof(int),root);

/* start trigger...*/

launch(trig, &t, buf);

for (i=0; i<1000 && !o[0]; i++) MemcpyDtoHAsync(o,out,sizeof(int),root);

/* make it stop */ b[0]=100; MemcpyHtoDAsync(buf,b,sizeof(int),root);


I expected this code to result in o[0]==126, since the trigger_kernel is

supposed to run asynchronously to the wait_kernel and to the host code. However,

it doesn’t result in expected behaviour - without “make it stop” line, the host thread

and the wait_kernel are dead-locked. With “make it stop” line, the result is o[0]==0

and out[0]==“101” (as expected).

Putting StreamSynchronize(trig) (or anything else that blocks on trig stream) after

trigger_kernel launch anywhere results in similar dead-lock.

When I replaced the trigger_kernel launch by host code as follows:

b[0]=120; MemcpyHtoDAsync(buf,b,sizeof(int),root);

which does result in correct o[0]==121 (*).

Also, launching the trigger_kernel before wait_kernel does result in correct behaviour (**).

From all this I infer:

  1. synchronization with the host via global memory is possible (*)

  2. device memory sharing between kernels running in different streams in one application is

    possible (**)

Given this, I expected that it would also work with the synchronization flag (“buf”)

allocated/initialized/read on the device global memory, but it smells like there is something fishy there…

Anybody else with similar problems? Or am I doing something really stupid?

P.S. I know the docs say that there is no synchronization mechanism between blocks (other than

   device global memory). This does mean that I could always use device global memory for 

   inter-block/kernel synchronization, no?

I think this is the expected behavior. Kernels can’t run in parallel right now. So your “trigger kernel” would never get a chance to set buf[0] because “waiting kernel” never returns. However, the memory copy can be overlapped with kernel execution.

Mu-Chi Sung, thanks a lot for your reply, it does explain a lot…

I wish it was documented explicitly that task-level parallellism is not possible (documentation only says that CUDA supports data-level parallelism).

Is it planned to support TLP in the near future? Since I can always get away with having just one

super-kernel with the following code in it, I guess it should not be hard…

switch(blockIdx.x) {

case 0: kernel_one(); break;

case 1: kernel_two(); break;



It has only been stated that you should not depend on kernels not running in parallel in the future.

I also have a kernel doing a bit like what you are doing, it does something like this:

switch(blockIdx.x) {

case 0: ptr = input1;

case 1: ptr = input2;


do my calcs

switch(blockIdx.x) {

case 0: output1 = ptr;

case 1: output2 = ptr;



The above keeps the register count low. Your option of running 2 different kernels is maybe not the best idea, since you might get high register counts, and at least you have the worst-case register count. But if they are comparable in register counts, you can do it just fine.

Yes, but isn’t the compiler supposed to relocate data to global device memory if it doesn’t have enough registers? I do need to place stuff that is shared between such “kernels” running in different blocks (and thus on different multiprocessors) in such global device memory anyways, so that’s not a problem.

I can rely on “auto” specifier to hint to the compiler that the thing should be in a register, and use the
“volatile” specifier to force it to global device memory (when its not shared), can I?

Unfortunately, the programming guide only specifies the behavior of “volatile” for shared data…

kernels doing different things running on different multiprocessors cannot really communicate using device memory since the order of execution is undefined. Also some blocks might not get scheduled before other blocks have finished, so you might up with deadlocks.

I think you want to run kernels after each other, but maybe you can explain what you want to do?

I would like to know if TLP-like processing is possible. It is certainly possible to have an overlap of computation between the GPU and the CPU, but since the GPU itself contains 8 SIMD multiprocessors, I would like to utilize that not just for one kernel only, but rather for a collection of different “kernels”.

The docs say that device memory can be used for communication between blocks (presumably of the same kernel) - does this not mean that if I start different device functions from one kernel, then the device memory can support communication between them?

Suppose I start the “kernels” as device functions using the “super-kernel” trick from above like this:

__device__ void kernel0(...), kernel1(...) ..., kernel7(...);

__global__ void super_kernel(...) {

switch (blockIdx.x) {

case 0: kernel0(...); break;

case 1: kernel1(...); break;


case 7: kernel7(...); break;



super_kernel<<<blocks=8, threads=1>>>(...).

Does this not give me a way to start 8 different _device functions simultaneously?

What critera must be fulfilled such that these these functions are mapped on 8 different blocks that are all activated at the same time on 8 different multiprocessors?

There is just no way to have any influence on scheduling. device memory is a way to communicate between kernel invocations, I would not call it a way to communicate between different device functions running at the same time. The normal way to do what you want would be :






That way you will be sure that kernel1 is finished before kernel2 is started, etc. Also you can have much more threads working for kernel1 than in your scheme.

It might be that with some kind of wait-looping in your device functions and not starting more blocks than multiprocessors you could maybe make it work, but is surely will not scale and it will also not be very portable from one graphics-card to another.

I would not go this way, unless it is for the fun of doing it. If you want something you can use in the long run (and with high performance) this does not seem like a good way to do it, unless you have a very specific problem to solve.

I’m just hoping that starting kernels on different streams in the future would imply that kernels
run in parallel, just like they do now with memcpy’s. But for now I could live with such “super-kernels”, no problem there as long as the smaller kernels are mapped to different multiprocessors.

Whether this is more or less efficient than starting one kernel with huge number of threads would depend on the amount of data a kernel is supposed to process at a time. If that number is not sufficient to load the GPU to the max, then the only way to performance is to run multiple (mostly) independent kernels simultaneously, on different multiprocessors.

Additionally, I suppose that starting a kernel incurs some latency, so given that the kernel processes only a small amount of data, this could lead even to slow-down, rather than speedup…

The reason I want these small “kernels” to get all started and then enter “wait” state is to limit that latency, since then it only takes one PCI-e and one global memory transaction to “trigger” it back to life (in my test the iteration count for my code is always <=2) …

And yes, I do have a specific problem to solve;-)

Yes, I indeed does, and it is quite a pain for me, the “intuitive” way to implement my algorithm leads to kernels that run about 10 us, with a startup latency of 20 us that means no speedup.

I managed to get a speedup by lumping as many as possible in one kernel, but I am still looking for a way to do that that makes it not a maintenance nightmare (so far the best approach involves lots of ugly macros).

But I doubt that your approach even if it worked would really help much to reduce latency, since in my estimation copying from host to device memory has similar latencies as starting a kernel.

If fear that at least for now, programming CUDA usually involves using lots and lots of very clever tricks and using some very different algorithms than you originally intended to :-)

I haven’t investigated the issue in depth yet, but my expectation is that kernel startup involves copying of instructions to the device, allocation of local device resources (shared memory, registers) and finally some trigger that kicks it to life. What I would like is to pre-load it, and then have just the latency of this trigger… Have you tried profiling launch of a kernel which contains a lot (thousands) of instructions?

My idea was actually to have “kernels” running on different multiprocessors synchronize using device memory (so, GPU-GPU), which might be faster than MemcpyHtoD (CPU-GPU), since there at least one less bus is involved…

BTW, my approach of retagging kernels as device and having a switch of up to 8 cases on blockIdx seems to work, but to be completely safe I would like to know how CUDA allocates blocks to multiprocessors - is it just a simple round-robin? (processor=blockIdx % 8) Also, if two blocks are allocated to the same multiprocessor, will the one with the lower index run first?

The next interesting question is whether GPUs in a multi-card setup can directly communicate to each other (via PCI-x) without the host being involved in orchestration of it all;-) I guess in the newer multi-GPU cards both G98s are attached to the same global memory, so that should not be a problem (as long as address spaces can be related), but will this work in a multi-card case?

The multi-GPU 9800 GX2 is actually two different cards sharing one PCI-Express interface. It appears to the system as two distinct CUDA devices, each with 512 MB of memory. Global memory is not shared between the two halves of the card.

How blocks get scheduled is undefined as far as I know. not all blocks take the same amount of time, and when a block is finished, the next one is quickly started on that MP.

multi-GPU cards have each their own global memory, you also have to control each GPU from your host-code.

communication between devices has been said as something that might come, but it is not possible at this time.

About the startup overhead: wumpus has posted some interesting things that he found out in the past. You can probably find them with the search (or otherwise by google)