GPU-CPU & GPU-GPU synchronization query on advanced CUDA features

pjotr · June 12, 2008, 5:10pm

Hallo CUDA hackers,

I have a query on CUDA memory consistency model across kernel invocations

w.r.t. the GPU-GPU synchronization. It may be a stupid question on my part, but it seems

that device global memory consistency/coherence is not always ensured, despite the docs

claiming that there are no caches for device global memory.

I am using the CUDA 2.0 on Linux (8800GTS 512 as Device#1 with no monitor attached,

Device#0 does have a monitor) with C using the low-level CUDA Driver API:

The setup is as follows:

Wait_kernel:

__global__ void wait_kernel(const volatile int buf[1],int out[1])  {       

          while (!buf[0]);

          out[0]=buf[0]+1;

    }

Trigger_kernel:

__global__ void trigger_kernel(volatile int buf[1])  {

          buf[0]=125; 

    }

main.c, which after the usual DriverAPI stuff, does roughly this (“w” is kernel for wait_kernel

and “t” is kernel for trigger_kernel, “b” and “o” are pointers to 1 integer in page-locked

memory, “buf” and “out” are devicepointers to 1 integer):

StreamCreate(&wait,0);StreamCreate(&trig,0);StreamCreate(&root,0);

/* start wait...*/

launch(wait, &w, buf,out);

for (int i=0; i<1000; i++) MemcpyDtoHAsync(o,out,sizeof(int),root);

/* start trigger...*/

launch(trig, &t, buf);

for (i=0; i<1000 && !o[0]; i++) MemcpyDtoHAsync(o,out,sizeof(int),root);

/* make it stop */ b[0]=100; MemcpyHtoDAsync(buf,b,sizeof(int),root);

StreamSynchronize(root);

I expected this code to result in o[0]==126, since the trigger_kernel is

supposed to run asynchronously to the wait_kernel and to the host code. However,

it doesn’t result in expected behaviour - without “make it stop” line, the host thread

and the wait_kernel are dead-locked. With “make it stop” line, the result is o[0]==0

and out[0]==“101” (as expected).

Putting StreamSynchronize(trig) (or anything else that blocks on trig stream) after

trigger_kernel launch anywhere results in similar dead-lock.

When I replaced the trigger_kernel launch by host code as follows:

b[0]=120; MemcpyHtoDAsync(buf,b,sizeof(int),root);

which does result in correct o[0]==121 (*).

Also, launching the trigger_kernel before wait_kernel does result in correct behaviour (**).

From all this I infer:

synchronization with the host via global memory is possible (*)
device memory sharing between kernels running in different streams in one application is

possible (**)

Given this, I expected that it would also work with the synchronization flag (“buf”)

allocated/initialized/read on the device global memory, but it smells like there is something fishy there…

Anybody else with similar problems? Or am I doing something really stupid?

P.S. I know the docs say that there is no synchronization mechanism between blocks (other than

   device global memory). This does mean that I could always use device global memory for 

   inter-block/kernel synchronization, no?

Mu-Chi_Sung · June 12, 2008, 10:01pm

I think this is the expected behavior. Kernels can’t run in parallel right now. So your “trigger kernel” would never get a chance to set buf[0] because “waiting kernel” never returns. However, the memory copy can be overlapped with kernel execution.

pjotr · June 12, 2008, 10:28pm

Mu-Chi Sung, thanks a lot for your reply, it does explain a lot…

I wish it was documented explicitly that task-level parallellism is not possible (documentation only says that CUDA supports data-level parallelism).

Is it planned to support TLP in the near future? Since I can always get away with having just one

super-kernel with the following code in it, I guess it should not be hard…

switch(blockIdx.x) {

case 0: kernel_one(); break;

case 1: kernel_two(); break;

...

}

E.D_Riedijk · June 13, 2008, 6:22am

It has only been stated that you should not depend on kernels not running in parallel in the future.

I also have a kernel doing a bit like what you are doing, it does something like this:

switch(blockIdx.x) {

case 0: ptr = input1;

case 1: ptr = input2;

...

do my calcs

switch(blockIdx.x) {

case 0: output1 = ptr;

case 1: output2 = ptr;

...

}

The above keeps the register count low. Your option of running 2 different kernels is maybe not the best idea, since you might get high register counts, and at least you have the worst-case register count. But if they are comparable in register counts, you can do it just fine.

pjotr · June 13, 2008, 9:41am

Yes, but isn’t the compiler supposed to relocate data to global device memory if it doesn’t have enough registers? I do need to place stuff that is shared between such “kernels” running in different blocks (and thus on different multiprocessors) in such global device memory anyways, so that’s not a problem.

I can rely on “auto” specifier to hint to the compiler that the thing should be in a register, and use the
“volatile” specifier to force it to global device memory (when its not shared), can I?

Unfortunately, the programming guide only specifies the behavior of “volatile” for shared data…

E.D_Riedijk · June 13, 2008, 9:50am

kernels doing different things running on different multiprocessors cannot really communicate using device memory since the order of execution is undefined. Also some blocks might not get scheduled before other blocks have finished, so you might up with deadlocks.

I think you want to run kernels after each other, but maybe you can explain what you want to do?

pjotr · June 13, 2008, 10:20am

I would like to know if TLP-like processing is possible. It is certainly possible to have an overlap of computation between the GPU and the CPU, but since the GPU itself contains 8 SIMD multiprocessors, I would like to utilize that not just for one kernel only, but rather for a collection of different “kernels”.

The docs say that device memory can be used for communication between blocks (presumably of the same kernel) - does this not mean that if I start different device functions from one kernel, then the device memory can support communication between them?

Suppose I start the “kernels” as device functions using the “super-kernel” trick from above like this:

__device__ void kernel0(...), kernel1(...) ..., kernel7(...);

__global__ void super_kernel(...) {

switch (blockIdx.x) {

case 0: kernel0(...); break;

case 1: kernel1(...); break;

...

case 7: kernel7(...); break;

}

}

super_kernel<<<blocks=8, threads=1>>>(...).

Does this not give me a way to start 8 different _device functions simultaneously?

What critera must be fulfilled such that these these functions are mapped on 8 different blocks that are all activated at the same time on 8 different multiprocessors?

E.D_Riedijk · June 13, 2008, 11:30am

There is just no way to have any influence on scheduling. device memory is a way to communicate between kernel invocations, I would not call it a way to communicate between different device functions running at the same time. The normal way to do what you want would be :

kernel1<<<>>>();

kernel2<<<>>>();

kernel3<<<>>>();

kernel4<<<>>>();

kernel5<<<>>>();

That way you will be sure that kernel1 is finished before kernel2 is started, etc. Also you can have much more threads working for kernel1 than in your scheme.

It might be that with some kind of wait-looping in your device functions and not starting more blocks than multiprocessors you could maybe make it work, but is surely will not scale and it will also not be very portable from one graphics-card to another.

I would not go this way, unless it is for the fun of doing it. If you want something you can use in the long run (and with high performance) this does not seem like a good way to do it, unless you have a very specific problem to solve.

pjotr · June 13, 2008, 11:47am

I’m just hoping that starting kernels on different streams in the future would imply that kernels
run in parallel, just like they do now with memcpy’s. But for now I could live with such “super-kernels”, no problem there as long as the smaller kernels are mapped to different multiprocessors.

Whether this is more or less efficient than starting one kernel with huge number of threads would depend on the amount of data a kernel is supposed to process at a time. If that number is not sufficient to load the GPU to the max, then the only way to performance is to run multiple (mostly) independent kernels simultaneously, on different multiprocessors.

Additionally, I suppose that starting a kernel incurs some latency, so given that the kernel processes only a small amount of data, this could lead even to slow-down, rather than speedup…

The reason I want these small “kernels” to get all started and then enter “wait” state is to limit that latency, since then it only takes one PCI-e and one global memory transaction to “trigger” it back to life (in my test the iteration count for my code is always <=2) …

And yes, I do have a specific problem to solve;-)

Reimar · June 14, 2008, 9:33am

Yes, I indeed does, and it is quite a pain for me, the “intuitive” way to implement my algorithm leads to kernels that run about 10 us, with a startup latency of 20 us that means no speedup.

I managed to get a speedup by lumping as many as possible in one kernel, but I am still looking for a way to do that that makes it not a maintenance nightmare (so far the best approach involves lots of ugly macros).

But I doubt that your approach even if it worked would really help much to reduce latency, since in my estimation copying from host to device memory has similar latencies as starting a kernel.

If fear that at least for now, programming CUDA usually involves using lots and lots of very clever tricks and using some very different algorithms than you originally intended to :-)

pjotr · June 14, 2008, 4:14pm

I haven’t investigated the issue in depth yet, but my expectation is that kernel startup involves copying of instructions to the device, allocation of local device resources (shared memory, registers) and finally some trigger that kicks it to life. What I would like is to pre-load it, and then have just the latency of this trigger… Have you tried profiling launch of a kernel which contains a lot (thousands) of instructions?

My idea was actually to have “kernels” running on different multiprocessors synchronize using device memory (so, GPU-GPU), which might be faster than MemcpyHtoD (CPU-GPU), since there at least one less bus is involved…

BTW, my approach of retagging kernels as device and having a switch of up to 8 cases on blockIdx seems to work, but to be completely safe I would like to know how CUDA allocates blocks to multiprocessors - is it just a simple round-robin? (processor=blockIdx % 8) Also, if two blocks are allocated to the same multiprocessor, will the one with the lower index run first?

The next interesting question is whether GPUs in a multi-card setup can directly communicate to each other (via PCI-x) without the host being involved in orchestration of it all;-) I guess in the newer multi-GPU cards both G98s are attached to the same global memory, so that should not be a problem (as long as address spaces can be related), but will this work in a multi-card case?

seibert · June 14, 2008, 6:12pm

The multi-GPU 9800 GX2 is actually two different cards sharing one PCI-Express interface. It appears to the system as two distinct CUDA devices, each with 512 MB of memory. Global memory is not shared between the two halves of the card.

E.D_Riedijk · June 14, 2008, 6:18pm

How blocks get scheduled is undefined as far as I know. not all blocks take the same amount of time, and when a block is finished, the next one is quickly started on that MP.

multi-GPU cards have each their own global memory, you also have to control each GPU from your host-code.

communication between devices has been said as something that might come, but it is not possible at this time.

About the startup overhead: wumpus has posted some interesting things that he found out in the past. You can probably find them with the search (or otherwise by google)

Topic		Replies	Views
GPU and CPU don't run in (pure) parallel ? CUDA Programming and Performance	24	20118	May 4, 2007
CUDA Memory Consistency CUDA Programming and Performance	23	55459	March 8, 2007
Multiple kernels in flight? CUDA Programming and Performance	19	26824	August 28, 2007
Some CUDA/GPU implementation related questions CUDA Programming and Performance	6	2259	May 30, 2009
Using Shared Memory in CUDA C/C++ Technical Blog	36	1943	October 8, 2020
GPU synchronization __threadfence() CUDA Programming and Performance	17	3381	August 7, 2010
Transfer data between host and device dynamicly? Maybe it's a problem. CUDA Programming and Performance	12	5259	April 2, 2008
Doubts related to CUDA CUDA Programming and Performance	17	11803	November 18, 2010
Can you GUESS this without experimenting? Latencies CUDA Programming and Performance	13	9346	January 7, 2008
CUDA Use Cases run serial algorithms on composite data CUDA Programming and Performance	14	4487	October 24, 2008

GPU-CPU & GPU-GPU synchronization query on advanced CUDA features

Related topics