Continuing global memory output between kernels

timlenertz · August 22, 2019, 3:13pm

I have two kernels like the following (pseudocode):

__global__ void kernel1(const int* in, int* out_a, int* out_b) {
    out_a[threadIdx.x] = f(in[threadIdx.x]);
    out_b[threadIdx.x] = g(in[threadIdx.x]);
}

__global__ void kernel2(const int* in, int* out) {
    out[threadIdx.x] = h(in[threadIdx.x]);
}

int main() {
    kernel1<<<...>>>(input, buffer_a, buffer_b);
    kernel2<<<...>>>(buffer_a, buffer_c);
}

So kernel1 and kernel2 are enqueued to execute after each other. kernel1 does a lot of global memory writes into its two output buffers A and B, which can cause a lot of latency.

But kernel2 does not need the output in buffer B. So performance could be gained if kernel2 would start executing once kernel1 has completed the writes into buffer B, while kernel1 finishes the write into buffer A at the same time.

Does CUDA do this if kernels are launched like in this code? Or would performance be gained if for example two streams are used for example?

Also, if in a case like this, buffer A and B are combined in one single buffer (so that from the point of view of the kernel arguments, it is not visible that kernel2 needs only a part of kernel1’s output), is there a performance loss?

Robert_Crovella · August 22, 2019, 3:23pm

CUDA doesn’t do that automatically. The only way to overlap operations occurring on the GPU is through streams. Having said that, I’m not sure it would be possible to use streams specifically to get one kernel to start halfway through the execution of another kernel.

If you wanted to arrange that, it would be better to partition your work in kernel1 into two separate kernels, one of which has the affect on B, and the second of which doesn’t. It would then be a trivial matter to allow kernel 2 to start after the completion of the first kernel from kernel1.

Whether or not any of this results in a performance benefit is a function of your code. Generally speaking if your kernels are large enough to fully utilize GPU resources, then overlap of kernel execution may not be possible, and may not lead to any performance benefit.

dlevi · August 23, 2019, 4:26pm

If f() and g() are independent, splitting kernel1 and running kernel1_g first would allow overlapping transfer of out_b to the host with kernel execution.

Again, whether or not that makes any significant difference depends on a lot of unprovided details; you’d need to do some experimentation and profiling.

Topic		Replies	Views
Overlapping kernel execution and memory copy CUDA Programming and Performance	6	9745	September 22, 2007
async memcopy/kernel from different contexts overlaping operations from different contexts.. CUDA Programming and Performance	9	2951	December 18, 2008
Is it possible to execute kernels in parallel CUDA Programming and Performance	9	4581	February 6, 2009
Concurrent kernel execution CUDA Programming and Performance	2	403	March 26, 2024
How to overlap execution of kernels in different streams with copy operations CUDA Programming and Performance	9	995	February 1, 2022
Multiple kernels in flight? CUDA Programming and Performance	19	26866	August 28, 2007
Processing Order with Cuda Streams in 7.5 CUDA Programming and Performance	13	2011	June 24, 2016
Using streams... Howto? CUDA Programming and Performance	0	1113	July 25, 2008
Kernel launch concurrency CUDA Programming and Performance	10	1815	December 11, 2014
How to synchronize between two kernels using CUDA? CUDA Programming and Performance	2	100	November 23, 2024

Continuing global memory output between kernels

Related topics