We are working in an HPC project, where the current concern is to achieve consistently high GPU utilization. Only under such a situation can we measure throughputs and determine number of nodes at leaf level in our cluster.
We adopted a multi-process design, where CPU processes handle preprocessing, and GPU processes handle computationally intensive operations. A job is first processed by a CPU process for preprocessing. As preprocessing is done, a GPU process is found to take over data from preprocessing and to launch executions to GPU. The CPU process waits for the GPU process to complete to take over the final result. This is the description of a whole job execution. As you can see, we have multiple GPUs, and thus, many GPU processes, and also many CPUs and CPU processes running on a single node. In case GPU is not fully utilized executing one job, MPS is adopted so that a GPU accepts computations from multiple jobs.
We can achieve near 100% using a certain number of GPUs and GPU processes, accompanied by the correct number of CPU processes. Unfortunately, we don’t see full GPU utilization after adding a new GPU to the node and increasing the number of GPU processes accordingly. CPUs are not fully utilized from observation of htop. I am confused in that since CPUs are not fully utilized, there should be extra room for them to process operations from CPU processes, so that GPUs can get data in time. However, from my observation, GPUs hang frequently as CPU processes are not fast enough to feed data to GPU processes. Frequent context switching from the system design itself is one of my guesses, especially when the scale of the system becomes large. As I am no expert on this problem, I am expecting HPC experts to confirm the thought and to provide more hints.
Achieving both 100% CPU utilization and 100% GPU utilization is not generally possible, as that implies the throughput of the GPU processing exactly matches the throughput of the CPU processing. If the goal is to keep the GPU close to 100% busy, some idling of the CPU should be expected. If the use case allows to easily shift work between CPU and GPU, you might want to look into some form of automated load balancing.
Since the performance of GPUs has grown faster than the performance of CPUs, it is not uncommon for applications to become (partially) limited by the host’s performance at this point in time. Generally speaking, higher CPU base frequency (personally I recommend >= 3.5 GHz) and to a lesser degree larger system memory (as a rule of thumb, system memory should be 2x-4x the sum of all GPU memory) and higher system memory throughput (ideal would be octa-channel DDR4) tend to correlate positively with improved feeding of GPUs. Obviously one would want to use the fastest available physical interconnect between host and GPU; in most cases that would be PCIe gen4 x16 at present.
The structure of the application is very important. It is possible to loose utilization to synchronization overhead. Use of API calls that explicitly or implicitly synchronize should be minimized. This includes memory allocation and de-allocation.
Generally speaking, the app should take the form of a software pipeline, in which all of the following occurs concurrently: (1) DMA transfer of data chunk N+1 to GPU (2) GPU operating on data chunk N (3) DMA transfer of the results for data chunk N-1 to the host system. This requires, for example, the use of asynchronous copies and CUDA streams. Depending on what needs to happens on the host, a double-buffering scheme may be needed for best performance.
Depending on how this app uses the cluster, it might make sense to look into using scheduling software like LSF for best overall utilization. I have not used LSF for many years at this point, but in the past I was a happy user of it for a number of years, even administrated it on a cluster of about 100 machines at one time.