Continuing global memory output between kernels

I have two kernels like the following (pseudocode):

__global__ void kernel1(const int* in, int* out_a, int* out_b) {
    out_a[threadIdx.x] = f(in[threadIdx.x]);
    out_b[threadIdx.x] = g(in[threadIdx.x]);
}

__global__ void kernel2(const int* in, int* out) {
    out[threadIdx.x] = h(in[threadIdx.x]);
}

int main() {
    kernel1<<<...>>>(input, buffer_a, buffer_b);
    kernel2<<<...>>>(buffer_a, buffer_c);
}

So kernel1 and kernel2 are enqueued to execute after each other. kernel1 does a lot of global memory writes into its two output buffers A and B, which can cause a lot of latency.

But kernel2 does not need the output in buffer B. So performance could be gained if kernel2 would start executing once kernel1 has completed the writes into buffer B, while kernel1 finishes the write into buffer A at the same time.

Does CUDA do this if kernels are launched like in this code? Or would performance be gained if for example two streams are used for example?

Also, if in a case like this, buffer A and B are combined in one single buffer (so that from the point of view of the kernel arguments, it is not visible that kernel2 needs only a part of kernel1’s output), is there a performance loss?

CUDA doesn’t do that automatically. The only way to overlap operations occurring on the GPU is through streams. Having said that, I’m not sure it would be possible to use streams specifically to get one kernel to start halfway through the execution of another kernel.

If you wanted to arrange that, it would be better to partition your work in kernel1 into two separate kernels, one of which has the affect on B, and the second of which doesn’t. It would then be a trivial matter to allow kernel 2 to start after the completion of the first kernel from kernel1.

Whether or not any of this results in a performance benefit is a function of your code. Generally speaking if your kernels are large enough to fully utilize GPU resources, then overlap of kernel execution may not be possible, and may not lead to any performance benefit.

If f() and g() are independent, splitting kernel1 and running kernel1_g first would allow overlapping transfer of out_b to the host with kernel execution.

Again, whether or not that makes any significant difference depends on a lot of unprovided details; you’d need to do some experimentation and profiling.