Is it recommended to throw multiple kernels at once?


Is it recommendet to throw multiple kernels at once? So, I mean, let’s supose we have some functions like:

__global__ void cusum(int* a, int* b){
  //performs some validations, then does a += b
  deviceadd(a[blockIdx.x * blockDim.x + threadIdx.x],
  b[blockIdx.x * blockDim.x + threadIdx.x]);

__global__ void cusub(int* a, int* b){
  //performs some validations, then does a -= b


I know these are really simple functions, are just to use as an example.
So, let’s suposse we want to calculate (a + b) * (c + d); we can just run two indepentent kernel to add and then just multiply after sync:

int main(){
  //a += b
  cusum<<<x, y>>>(a, b);

  //c -= d
  cusub<<<x, y>>>(c, d);

  //sync both kernels

  //a *= c
  cumul<<<x, y>>>(a, c);

  //sync cumul kernel

But, it can be also done on one big function as well

int main(){
  cuCalculate<<<x, y, >>>(a, b, c, d);

  //sync cuCalculate

__global__ void cuCalculate(int* a, int* b, int* c, int* d){

So, with first method I understand that it can works faster as both additions are made on parallel, but I’m not sure it won’t will cause some slowdown as there will be (maybe) too many operations working simultaneously.

I would like to understand the main pros and cons about both approaches (I may assume that first one will need a more powerful GPU to work fine?)


The second method is generally preferred. There are a variety of reasons for this.

  1. Every kernel launch has overhead (a time cost). Fewer kernel launches - less overhead.
  2. You generally want to do as much work per kernel launch as possible. This is related to items 1 and 3, and also so that if doing more work equates to being able to launch more threads, you have a better opportunity to saturate the machine and give the GPU the best opportunity for latency hiding.
  3. The idea of kernel fusion. By combining operations that are working on the same data, we may be able to reduce loads/stores to global memory, which can have a significant impact on a memory-bound code.

The only thing I can think of to commend the first method is that if your problem size is so small, that you cannot saturate the GPU with a single kernel launch. My preference there would be to seek to expose more parallelism.

In any event, it should be easy enough to cook up an example for both cases, and benchmark and draw your own conclusion. I generally advise against taking work that can be done in a single kernel and breaking it into 2 or more kernels. I’m sure exceptions, corner cases can be found.

Hi Robert:

I didn’t benchmark it as I have a really small GPU (laptop series 920M), and I thought the results may vary from modern GPU.

I read the fusion article and it does lots of sense (on my previous scenario, a and c are loaded and unladed two times).

I would may some testing using visual profiler, thanks for your answer :-)


I just profiled a sample, and got some revelations!
I had a “light” kernel (basically copy memory), and a “heavy” kernel (it calculates a = a * a * c on some BigInteger values)

I run both kernel simultaneously, and here are the results

Test 1: I launch the heavy one, then light, then sync:

Test 2: I launch the light one, then heavy, then sync:

On both scenarios, there are no work overlap, there are just one kernel waiting for the other to end. Details shows that there are really no overlap. End time for first kernel is start time for the next one…

It may be caused due my small GPU?

Kernels generally don’t run concurrently. If you want to see kernels execute concurrently, there are a number of requirements you will have to satisfy. This topic has been covered many times on the web. There is a concurrent kernels sample code you can study if you wish.

Again thanks Robert. I thought it was as simply as just run both kernels to have it in a concurrent way.

Now it makes even more sense to unify all the work in one big kernel!