4-way concurrency

I happened to review a NVIDIA presentation on optimizing GPU performance where there is a mention of 4-way concurrency. https://developer.download.nvidia.com/CUDA/training/StreamsAndConcurrencyWebinar.pdf

The presentation describes CUDA kernels being parallelly run on GPU and CPU to improve the performance. Are there any examples for 4-way concurrency? Has anyone tried it successfully? Thanks much.

4 way concurrency refers to:

  • A H->D transfer
  • A GPU kernel
  • A D->H transfer
  • Some CPU code

all running at the same time. It doesn’t mean that “CUDA kernels being parallelly run on GPU and CPU”

CUDA kernels don’t run on the CPU, but CPU code does, and you can run CPU code on the CPU at the same time that a CUDA kernel is running on the GPU.

In the specific example you linked, for example on slide 6, the meaning is that the matrix multiplication problem can be broken up into “tiles of work”. The processing of a tile can take place either on the CPU or on the GPU. It is basically a mini-matrix-multiply. So the 4 way concurrency there means processing tiles on the GPU and processing (different) tiles on the CPU. Ultimately the workload split there might be 80% of tiles processed on the GPU and 20% on CPU, or an even “higher” split like 90/10 or 95/5.

There are various linear algebra algorithms that can be “tile decomposed” this way. As an example, the NVIDIA implementation of HPL in the NGC HPC benchmark container has an environment variable that can be used to direct the splitting of work this way.

1 Like

I note that the linked slide deck mentions the C2070, so it can be dated to the early years of CUDA. Generally I would caution against splitting identical work between GPU and CPU to distribute a workload.

Historically, this is what some applications chose to do in the Fermi time-frame, but by the time the Pascal rolled around, they found their application unnecessarily limited by CPU activity. Conversely, those applications that had focused on a “GPU-only” approach were in the advantage, without undue performance restrictions.

The primary reason for this development was that GPU performance increased much faster than CPU performance, and a secondary reason was that host/device communication pipe throughput grew fairly slowly.

Generally speaking, the advantageous approach to a GPU-accelerated application is therefore to move all the computational heavy-lifting to the GPU, including parts that expose only limited parallelism, and keep data resident on the GPU as long as possible. In this scenario, the CPU serves primarily as a control processor and storage administrator (with GPUDirect the later aspect looses some importance). With the CPU serving the serial portion of the code, it is important to focus CPU selection on high single-thread throughput, not on high CPU core counts (typically 4 CPU cores per GPU are sufficient). And it is important not to under-provision system memory, which is the immediate data source / data sink for the GPU (system memory should be sized to 2x to 4x of total GPU memory, depending on use case and system size). NVMe mass storage is advantageous in various application areas as a back-up to the system memory buffering.