Tips for improving multi-kernel pipeline

louis.child · August 21, 2023, 4:19pm

Hi, I am evaluating Cuda as an alternative to an already existing OpenCL project we develop at work. Right now I have done a simple conversion of one of our OpenCL modules to Cuda, and as someone who is entirely new to Cuda I am looking for tips on how to improve the performance of the Cuda pipeline.

The pipeline looks like this:
kernel_diagram

Some points about the pipeline:

The kernels in the center are independent from each other, however since the host code is single threaded they are submitted sequentially
The main bulk of the memcpy is done before kernel 1 is submitted
Each kernel submission follows the same pattern of allocate device memory for that kernel, submit kernel with device buffers and parameters, move onto the next kernel

What recommndataions would you make for someone who is new to Cuda to try improve the overall performance of the pipeline? Unfortunately I can’t share the code as it’s technically confidential IP.

Thanks in advanced

Robert_Crovella · August 21, 2023, 4:24pm

If it were me, I would issue kernels 2-7 in separate, user-created streams.

I would try to do the necessary device code allocations before getting into work-issuance. If possible, use cudaMemcpyAsync (using the same stream for the subsequent kernel that depends on it).

If by chance you have multiple GPUs available, consider submitting the work of kernels 2-7 into multiple GPUs. This may require some extra synchronization before launching the group of 2-7 and before launching 8.

It’s possible that none of my suggestions may make any improvement. Basically all of my suggestions have analogs for OpenCL, so the ideas are not really unique to CUDA. Your existing OpenCL code might already be doing these things.

This training series may be of interest for those new to CUDA. In particular, the section on concurrency might be relevant.

louis.child · August 21, 2023, 4:29pm

Thanks for the tip for the streams I will look into setting them up.

As for the allocations, kernels 2-7 are mainly fixed function. Other than some config values, which are passed as regular structs, there is no actual memcpy done for kernels 2-7. The parttern is simply creating output buffers that the kernels will put their data into and then submitting the kernel. Would it still have the same effect to create the output buffers before kernel 1 is submitted?

Unfortunately I do not have access to multiple GPUs. The program is designed to run locally on user (consumer) machines

Thanks for the link, that wasn’t one I managed to find when looking for training resources online

Robert_Crovella · August 21, 2023, 5:06pm

A cudaMalloc call is potentially synchronizing. That means that it may inject a “dead spot” into device code activity. When I am teaching CUDA I generally recommend to get all buffer allocations done up-front, before work issuance begins, if possible. Whether it would be important or significant for your app I cannot say conclusively. The suggestions I have are generally more meaningful/impactful when there is a work issuance loop. The more sophisticated your concurrency goals are, the more annoying a cudaMalloc call can be.

Also, if your kernels are very short in duration, using CUDA graphs may provide some measure of improvement. Again, only valuable when there is repetitive work issuance.

louis.child · August 23, 2023, 12:30pm

Thanks for the input. Unfortunately it hasn’t really shifted then dial. I did look at Cuda graphs, but I get the impression that that only really helps when you do the same pipeline multiple times. Currently we are just evaluation a single invocation of the entire pipeline.

I did try to use NSight Compute/Systems, but it’s far more complicated than I was expecting and I haven’t found a good example of how to use them to optimise a pipeline

Topic		Replies	Views
How to implement calculation pipeline via CUDA streams ? CUDA Programming and Performance	3	6643	January 17, 2013
same kernel on different data CUDA Programming and Performance	3	1488	November 17, 2008
Concurrent Kernel executions & Data Transfers CUDA Programming and Performance cuda	3	716	March 8, 2023
using streams for async memory operations Is it worth splitting kernel launch into several streams ? CUDA Programming and Performance	2	1208	May 20, 2009
cuda stream CUDA Programming and Performance	3	5887	April 6, 2011
Cuda with openMP CUDA Programming and Performance	10	15848	June 13, 2010
Is it recommended to throw multiple kernels at once? CUDA Programming and Performance cuda , kernel	6	2856	October 12, 2021
Inferior Results on C2070 when Using Streams CUDA Programming and Performance	3	1989	March 6, 2012
Continuing global memory output between kernels CUDA Programming and Performance	2	536	August 23, 2019
Using the NVIDIA CUDA Stream-Ordered Memory Allocator, Part 1 Technical Blog	1	704	September 13, 2024

Tips for improving multi-kernel pipeline

Related topics