4-way concurrency

arun.gct · October 20, 2022, 6:23pm

I happened to review a NVIDIA presentation on optimizing GPU performance where there is a mention of 4-way concurrency. https://developer.download.nvidia.com/CUDA/training/StreamsAndConcurrencyWebinar.pdf

The presentation describes CUDA kernels being parallelly run on GPU and CPU to improve the performance. Are there any examples for 4-way concurrency? Has anyone tried it successfully? Thanks much.

Robert_Crovella · October 20, 2022, 7:00pm

4 way concurrency refers to:

A H->D transfer
A GPU kernel
A D->H transfer
Some CPU code

all running at the same time. It doesn’t mean that “CUDA kernels being parallelly run on GPU and CPU”

CUDA kernels don’t run on the CPU, but CPU code does, and you can run CPU code on the CPU at the same time that a CUDA kernel is running on the GPU.

In the specific example you linked, for example on slide 6, the meaning is that the matrix multiplication problem can be broken up into “tiles of work”. The processing of a tile can take place either on the CPU or on the GPU. It is basically a mini-matrix-multiply. So the 4 way concurrency there means processing tiles on the GPU and processing (different) tiles on the CPU. Ultimately the workload split there might be 80% of tiles processed on the GPU and 20% on CPU, or an even “higher” split like 90/10 or 95/5.

There are various linear algebra algorithms that can be “tile decomposed” this way. As an example, the NVIDIA implementation of HPL in the NGC HPC benchmark container has an environment variable that can be used to direct the splitting of work this way.

njuffa · October 20, 2022, 9:47pm

I note that the linked slide deck mentions the C2070, so it can be dated to the early years of CUDA. Generally I would caution against splitting identical work between GPU and CPU to distribute a workload.

Historically, this is what some applications chose to do in the Fermi time-frame, but by the time the Pascal rolled around, they found their application unnecessarily limited by CPU activity. Conversely, those applications that had focused on a “GPU-only” approach were in the advantage, without undue performance restrictions.

The primary reason for this development was that GPU performance increased much faster than CPU performance, and a secondary reason was that host/device communication pipe throughput grew fairly slowly.

Generally speaking, the advantageous approach to a GPU-accelerated application is therefore to move all the computational heavy-lifting to the GPU, including parts that expose only limited parallelism, and keep data resident on the GPU as long as possible. In this scenario, the CPU serves primarily as a control processor and storage administrator (with GPUDirect the later aspect looses some importance). With the CPU serving the serial portion of the code, it is important to focus CPU selection on high single-thread throughput, not on high CPU core counts (typically 4 CPU cores per GPU are sufficient). And it is important not to under-provision system memory, which is the immediate data source / data sink for the GPU (system memory should be sized to 2x to 4x of total GPU memory, depending on use case and system size). NVMe mass storage is advantageous in various application areas as a back-up to the system memory buffering.

Topic		Replies	Views
Concurrent copy and kernel execution: Yes with 6 copy engine(s)---How to make full use of it? CUDA Programming and Performance	8	4450	May 5, 2022
Survey of common CUDA runtime profiles What is your application like? CUDA Programming and Performance	11	7503	February 2, 2010
CPU cores vs GPUs CUDA Programming and Performance	6	9843	March 18, 2009
\|\| programming, basic question CUDA Programming and Performance	18	1293	April 30, 2018
Is it possible to take advantage of multi-core CPU parallelism while using CUDA? Considering using C CUDA Programming and Performance	3	5590	April 4, 2011
Asynchronous performance between CPU and GPU CUDA Programming and Performance	3	2382	June 18, 2012
Overlapping kernel execution and memory copy CUDA Programming and Performance	6	9733	September 22, 2007
Advanced Strategies for High-Performance GPU Programming with NVIDIA CUDA Technical Blog	2	56	February 6, 2025
Parallelization schemes What schemes do you use when processing large datasets? CUDA Programming and Performance	6	906	December 23, 2010
concurrency among copies: is it possible? CUDA Programming and Performance	5	2661	December 7, 2012

4-way concurrency

Related topics