I have a multi stage process where data comes in a goes thru 5 different kernels before generating a result. Most of the examples I have seen deal only with using one kernel at a time. But in this case, I need parallel kernels. Basically, once the data is finished with kernel 1, it goes onto kernel 2 and the next dataset moves into kernel 1.
1 → 2 → 3 → 4-> 5
Ideally, all 5 kernels would run at the same time and then some sort of synchronization would happen at the end of the “cycle” to move the data to the next kernel.
Sorry Jim, you can only synchronise kernels on the host. CUDA only runs one kernel at a time and there is no sync between blocks. I get the impression that this is not likely to change soon. So you need to try to get as much as you can into a single kernel.
Cheers, Eric
Note that you can keep the interim results on the GPU main memory and just refer to them from the second and subsequent kernels. I do this all the time. Just use the same device pointers in the host code.
Soon I will be facing the same issue of executing kernels in parallel.
As I know - I read it here somewhere - the call to a kernel is blocking. So I think it will not be possible to run parallel kernels except there are more cards in the system.
For my understanding: Imagine I’ve got 16 kernels which only use one multiprocessor. Every kernel needs 1ms to execute. So if I get alle kernels parallel to execute I will have a delay for the first result of 15ms but then a further result every ms.
Actually this is the common reason for pipelining something. I want to get my results faster but the delay for the first result doesn’t matter at all.
Do you think I can achieve the same performance if I’ve got 16 kernels which will be executed sequently but every kernel can use all microprocessors?
If each of the 16 kernels is not too complicated, I would fuse some or all of them into one bigger kernel with an initial switch statement. Then execute the big kernel to run all pipeline stages at once and change the data pointers on the host after each call such that the “data moves through the pipeline”. In order not to run into issues with “empty” pipeline slots, just discard the first 16 results produced by the pipeline then.
I read in the programming guide that it is a very bad idea to split the code into more program flows - e.g. with if then else or something like that - because all processors of a multiprocessor always execute the same instruction and some of them would stall.
The switch would be at the very beginning of the code and of course should depend on blockIdx. So all threads in the block and therefore all warps will take the same path. No divergence possible.
To answer your first question, you can run different code in a given threadblock (i.e. on a multiprocessor) without a performance penalty, as long as no warp contains threads running different code. I believe that’s what prkipfer was saying above. So, for the 16 kernels you’d have to make sure that each kernel is executed by a multiple of warp_size (32 for g80) threads. You’d have to use __syncthreads() to ensure that each “pipeline stage” executes on time.
Now, for your second question. That’s a bit tougher, since you need a synchronization mechanism to sequence the 16 stages. The problem is that there isn’t a nice way to sync across threadblocks. I’m currently working on a sync-type function that will block across threadblocks, which is working with a mixed success - reliable as long as the number of threadblocks is less or equal to the number of multiprocessors (and card is dedicated to cuda work), issues pop up otherwise. You may not need it though, as in your case there are two possibilities:
the different “pipeline stages” in the same threadblock process the same data. This means that there is no need to transfer data between threadblocks, so there’s no need to synchronize across threadblocks.
the different “pipeline stages” in the same threadblock process data that comes from other threadblocks. In this case you need to write and then read the data through global memory. This would also require a sync across threadblocks. But, since you communicate through global memory anyway, you might as well put the different “pipeline stages” into their own kernels and invoke them sequentially from the host.