Parallel Kernels Best practices for creating a pipeline

jimwmiller · May 24, 2007, 10:27pm

Hi,

I have a multi stage process where data comes in a goes thru 5 different kernels before generating a result. Most of the examples I have seen deal only with using one kernel at a time. But in this case, I need parallel kernels. Basically, once the data is finished with kernel 1, it goes onto kernel 2 and the next dataset moves into kernel 1.

1 → 2 → 3 → 4-> 5

Ideally, all 5 kernels would run at the same time and then some sort of synchronization would happen at the end of the “cycle” to move the data to the next kernel.

What is the best practice to accomplish this?

Thanks for any help!
-Jim

osiris1 · May 24, 2007, 11:18pm

Sorry Jim, you can only synchronise kernels on the host. CUDA only runs one kernel at a time and there is no sync between blocks. I get the impression that this is not likely to change soon. So you need to try to get as much as you can into a single kernel.
Cheers, Eric

mstock · May 25, 2007, 1:34am

Note that you can keep the interim results on the GPU main memory and just refer to them from the second and subsequent kernels. I do this all the time. Just use the same device pointers in the host code.

pototschnig · June 1, 2007, 8:22am

Soon I will be facing the same issue of executing kernels in parallel.

As I know - I read it here somewhere - the call to a kernel is blocking. So I think it will not be possible to run parallel kernels except there are more cards in the system.

For my understanding: Imagine I’ve got 16 kernels which only use one multiprocessor. Every kernel needs 1ms to execute. So if I get alle kernels parallel to execute I will have a delay for the first result of 15ms but then a further result every ms.

Actually this is the common reason for pipelining something. I want to get my results faster but the delay for the first result doesn’t matter at all.

Do you think I can achieve the same performance if I’ve got 16 kernels which will be executed sequently but every kernel can use all microprocessors?

Thx for help

regards
Pototschnig

prkipfer · June 1, 2007, 10:28am

If each of the 16 kernels is not too complicated, I would fuse some or all of them into one bigger kernel with an initial switch statement. Then execute the big kernel to run all pipeline stages at once and change the data pointers on the host after each call such that the “data moves through the pipeline”. In order not to run into issues with “empty” pipeline slots, just discard the first 16 results produced by the pipeline then.

Peter

pototschnig · June 1, 2007, 11:15am

I read in the programming guide that it is a very bad idea to split the code into more program flows - e.g. with if then else or something like that - because all processors of a multiprocessor always execute the same instruction and some of them would stall.

Therefore I don’t like this idea at all, sorry.

regards

Pototschnig

prkipfer · June 1, 2007, 11:30am

The switch would be at the very beginning of the code and of course should depend on blockIdx. So all threads in the block and therefore all warps will take the same path. No divergence possible.

Peter

paulius · June 1, 2007, 1:33pm

To answer your first question, you can run different code in a given threadblock (i.e. on a multiprocessor) without a performance penalty, as long as no warp contains threads running different code. I believe that’s what prkipfer was saying above. So, for the 16 kernels you’d have to make sure that each kernel is executed by a multiple of warp_size (32 for g80) threads. You’d have to use __syncthreads() to ensure that each “pipeline stage” executes on time.

Now, for your second question. That’s a bit tougher, since you need a synchronization mechanism to sequence the 16 stages. The problem is that there isn’t a nice way to sync across threadblocks. I’m currently working on a sync-type function that will block across threadblocks, which is working with a mixed success - reliable as long as the number of threadblocks is less or equal to the number of multiprocessors (and card is dedicated to cuda work), issues pop up otherwise. You may not need it though, as in your case there are two possibilities:

the different “pipeline stages” in the same threadblock process the same data. This means that there is no need to transfer data between threadblocks, so there’s no need to synchronize across threadblocks.
the different “pipeline stages” in the same threadblock process data that comes from other threadblocks. In this case you need to write and then read the data through global memory. This would also require a sync across threadblocks. But, since you communicate through global memory anyway, you might as well put the different “pipeline stages” into their own kernels and invoke them sequentially from the host.

Paulius

Topic		Replies	Views
Is it possible to execute kernels in parallel CUDA Programming and Performance	9	4569	February 6, 2009
Keeping data between kernels calls CUDA Programming and Performance	6	410	February 11, 2024
Multiple kernels in flight? CUDA Programming and Performance	19	26829	August 28, 2007
simultaneous execution of kernels in cuda CUDA Programming and Performance	1	409	September 17, 2019
CUDA Use Cases run serial algorithms on composite data CUDA Programming and Performance	14	4489	October 24, 2008
Threaded CUDA Multiple concurrent kernels? CUDA Programming and Performance	9	5594	October 20, 2009
How can I be certain my Kernel runs with 32 threads in one block and thus perfect synchrony? (ie. via __syncthreads()) CUDA Programming and Performance	15	40	August 21, 2024
CUDA 1.0 Asynchronous Launches CUDA Programming and Performance	10	9435	June 29, 2007
A question the parallelization CUDA Programming and Performance	5	2694	July 29, 2008
Concurrent kernel CUDA Programming and Performance	8	1408	January 14, 2024

Parallel Kernels Best practices for creating a pipeline

Related topics