Hi, I am evaluating Cuda as an alternative to an already existing OpenCL project we develop at work. Right now I have done a simple conversion of one of our OpenCL modules to Cuda, and as someone who is entirely new to Cuda I am looking for tips on how to improve the performance of the Cuda pipeline.
The pipeline looks like this:

Some points about the pipeline:
- The kernels in the center are independent from each other, however since the host code is single threaded they are submitted sequentially
- The main bulk of the memcpy is done before kernel 1 is submitted
- Each kernel submission follows the same pattern of allocate device memory for that kernel, submit kernel with device buffers and parameters, move onto the next kernel
What recommndataions would you make for someone who is new to Cuda to try improve the overall performance of the pipeline? Unfortunately I can’t share the code as it’s technically confidential IP.
Thanks in advanced
If it were me, I would issue kernels 2-7 in separate, user-created streams.
I would try to do the necessary device code allocations before getting into work-issuance. If possible, use cudaMemcpyAsync
(using the same stream for the subsequent kernel that depends on it).
If by chance you have multiple GPUs available, consider submitting the work of kernels 2-7 into multiple GPUs. This may require some extra synchronization before launching the group of 2-7 and before launching 8.
It’s possible that none of my suggestions may make any improvement. Basically all of my suggestions have analogs for OpenCL, so the ideas are not really unique to CUDA. Your existing OpenCL code might already be doing these things.
This training series may be of interest for those new to CUDA. In particular, the section on concurrency might be relevant.
Thanks for the tip for the streams I will look into setting them up.
As for the allocations, kernels 2-7 are mainly fixed function. Other than some config values, which are passed as regular structs, there is no actual memcpy done for kernels 2-7. The parttern is simply creating output buffers that the kernels will put their data into and then submitting the kernel. Would it still have the same effect to create the output buffers before kernel 1 is submitted?
Unfortunately I do not have access to multiple GPUs. The program is designed to run locally on user (consumer) machines
Thanks for the link, that wasn’t one I managed to find when looking for training resources online
A cudaMalloc
call is potentially synchronizing. That means that it may inject a “dead spot” into device code activity. When I am teaching CUDA I generally recommend to get all buffer allocations done up-front, before work issuance begins, if possible. Whether it would be important or significant for your app I cannot say conclusively. The suggestions I have are generally more meaningful/impactful when there is a work issuance loop. The more sophisticated your concurrency goals are, the more annoying a cudaMalloc
call can be.
Also, if your kernels are very short in duration, using CUDA graphs may provide some measure of improvement. Again, only valuable when there is repetitive work issuance.
Thanks for the input. Unfortunately it hasn’t really shifted then dial. I did look at Cuda graphs, but I get the impression that that only really helps when you do the same pipeline multiple times. Currently we are just evaluation a single invocation of the entire pipeline.
I did try to use NSight Compute/Systems, but it’s far more complicated than I was expecting and I haven’t found a good example of how to use them to optimise a pipeline