Keeping data between kernels calls


I have a quite long kernel that I want to optimize. My idea is to split this single kernel into multiple smaller ones that can be better optimized. The problem I see is now that the result of one kernel is the input data of the next. So far the only solution I found to keep the data between different kernel calls is the use of global memory. If I unterstand correctly the lifetime of shared memory does not allow the use for different kernels - so that is unfortunately not an option.

Is there any other way to keep the data?


Better optimized in which sense? What factors will be driving the performance improvements that are being envisioned?

Keeping data resident on the GPU and minimizing data transfers between host and device is the correct approach.


thanks for your fast reply. Some part of the kernel can make use of a whole warp to solve a single problem, other parts may use just a single thread for a fast calculation. This is alternating through the kernel. So I want to separate the parts that can make use of more than a warp to solve a problem from the other that needs just a single thread.

Hope I could make my point clear?

I would hope that the kernel launch comprises more than one warp, because that would still be very inefficient. As for the single-threaded portions: I can see how this may be necessary if there is simply no inherent parallelism available, but I am not sure what kind of situation “just a single thread for a fast calculation” refers to.

Generally speaking, the kind of situation you describe, with varying degrees of parallelism available throughout a kernel, is not uncommon and may well be the highest-performing option. A thorough design process would prototype both this design approach as well as the split-kernel approach to assess which one is faster for this particular use case.

The other aspect that is likely worth investigating is whether there are algorithmic changes that could be used to increase the amount of available parallelism, so as to avoid single-threaded computation. Alternatively or additionally, since single-threaded execution is ideally suited to the CPU, one might alternatively look into how these portions of the computation could be moved to the host, maybe in the form of pre-computation.

A thorough search of the literature might turn up interesting ideas. How to best map work to execution resources on massively parallel computational platforms is still an area of active research. Sometimes a switch from traditional data structures can be the key to better parallelization. By now the applicability of GPUs to just about any kind of computational task has been looked into, so a literature search is highly likely to yield some useful information.

The process I like to use for best results is to first spend some quality time thinking up a design and prototyping it (usually with variants) by myself. I put serious effort into this. In a second stage I then search the literature extensively, going back multiple decades, looking for anything even remotely applicable.

Having made a serious first effort by myself usually provides me with enough insight that I can make a reasonable assessment whether some idea from the literature is (1) roughly along my own line(s) of though, (2) inferior to my own ideas (3) superior (at least in parts) to my best efforts. It is fairly rare (but happens), that I realize that an approach from group (3) is so vastly superior to my own ideas that I adopt it wholesale. Most of the time my final design is the synthesis of my original ideas and ideas from the literature: the “stand on the shoulder of giants” approach.

Thanks again for your detailed answer. Yes, I agree that prototype is one way to go (already started).

Just to clarify my approach since I think I didn’t explain very well what I try to do.

With ‘single threaded’ I do not mean that I start a kernel with a single grid and 32 threads. What I mean is that each thread gives me a single solution. I’m also happy with the performance of these parts.

In other parts of the kernel I can use 4 thread to improve performance, also giving me a single solution, and in other parts I can make use of a complete warp.

If all of these parts work in a single kernel I need to configure the kernel in that way that the part that needs most of the threads determine how many solution my kernel can calculate.


  1. part 1 thread → one solution
  2. part 4 threads → one solution
    3 part: 32 thread → one solution

If I start now a kernel like this:
grid <64, 1, 1> and block <128, 1, 1>

I use 64 * 128 threads that gave me 256 solutions since the last part needs 32 threads for a single result. And on the other side I have most of the GPU doing nothing for part 1 and 2.

My idea for splitting the kernel is now to run part 1 in its own kernel with a good occupancy and store all results in memory. The run part 2, reading the results from part 1 and continue calculation also in configuration that its kernel make good use of the gpu - then finnaly run the last part - again with its own kernel configuration.

I hope to get better performance because part 1 and 2 can use their own configuration resulting in a better occupancy. I can also use the unused MC for something else in a different stream (if the used resources will allow this). But this something for a later development if splitting the kernel gives the expected results.

Thanks for your help and insights.

Hi TrailingStop,
Variant 1:
it can be done by e.g. using a single kernel and have each warp calculate 32 solutions. In your example: For the first part of the kernel, each thread of the warp calculates one solution; for the second part, you have a for loop from 0 to 3 and within the loop body 4 threads each cooperate for 8 independent calculations concurrently times 4 loop iterations; for the third part you have a for loop from 0 to 31 with only one solution calculated for each iteration.

The data can be exchanged by warp shuffle or by shared memory.

Variant 2:
A different approach would be to have whole warps doing nothing instead of some threads of a warp. Keeping warps idle (e.g. by blocking for a _syncthreads()) is free, because the blocked warps are not scheduled and do not use up compute resources, except that the effective occupancy lowers.

Your example would be a (too) extreme case. But basically it would go: Start 32x32 thread blocks. For part 1 one warp is active; for part 2 fours warps are active; for part 3 thirty-two warps are active.

The data would be exchanged by shared memory.

The thread indices of threads within a block can have different meanings throughout your kernel. You are not bound to map them 1:1 to your problem space indices. Just see the threads as numbered resources. Between each _syncwarp() (or even _syncthreads()) you can redefine, how the work is distributed onto threads, and you can add any for loop within the kernel on top of it.

Hi Curefab,

thanks a lot for this information. I will give them a try and check which one works best in my case.