In our company, we have an in-house CUDA code that has been developed 10 years ago.
This code is running on a multi GPU machine ( 2xThreadripper + 8x Ampere GPU) using one GPU (in single precision).
When launching 8 of those simulation on one machine, then the performance breaks down.
Profiling the application shows that when many computations run on the same machine, then the execution times for calls CudaMalloc and CudaFree increase drastically (~x10).
Remarks:
The different computations on one Machine do not communicating one with the others.
The code was planned as a multi-gpu code, therefore there is a declaration of DeviceToDevice communication:
This is a known observation, and you can find other similar reports on these forums. The CUDA documentation makes a general provision for this observation here:
Any CUDA API call may block or synchronize for various reasons such as contention for or unavailability of internal resources.
If you are launching 8 independent simulations, each of which is using a single unique GPU, then you might try launching each instance with a CUDA_VISIBLE_DEVICES="X" preamble, where X takes on a value from 0 to 7 for each unique/independent sim. This will limit the “visibility” of the CUDA runtime in each case, and it may help.
and so on. CUDA_VISIBLE_DEVICES has a device index remapping characteristic. So currently, if you are telling each individual invocation of ./sim which GPU to use, such as:
./sim 0
./sim 1
etc., then you would want to modify the invocation to something like:
I don’t have any further suggestions for that, then (assuming your SLURM assigns a single unique device to each process).
If your application is bound by cudaMalloc/cudaFree performance, then you might want to see if you can reduce the use of those APIs, for example reusing allocations, or switch to a pool allocator that you manage yourself (CUDA has a pool allocator available, but I don’t happen to know if it is subject to this multi-thread/multi-process contention issue).
Yes, SLURM assigns the variable correctly. (Checked by echo $CUDA_VISIBLE_DEVICES)
If I understand your idea correctly, you wanted to isolate the visible GPUs for one execution thread to one gpu in order to prevent the sync to be expanded to all GPU.
It’s not related to synchronizations. It’s connected to contention for a shared internal host-based resource managed by the CUDA runtime, where the access control often involves acquisition of a lock. The contention for the lock (and indeed simultaneous access to the shared resource) is causing the increase in the time duration of cudaMalloc/cudaFree. None of this is documented (the above link indicates this aspect of CUDA runtime behavior is explicitly undocumented, and subject to change), but you can find posts on these forums where people have provided evidence that locks are being contended for, in at least some of these cases.
I personally doubt that changing the driver will help.
You’re welcome to try it. It’s not a bad idea, whenever having a problem with CUDA GPUs, to update the GPU driver to the latest available.
But this issue is something that has persisted for quite some time (many years, in my experience) and its clear that the issue is known to the CUDA designers, otherwise why the doc statement I provided?
So I’m not optimistic that changing a driver would help.
You can call it a “driver issue” if you wish, but I would suggest it is probably happening by design.
The only suggestion I have to offer is the one I already made. “Don’t do that.” If you are making extensive use of cudaMalloc/cudaFree such that its performance is the limiting factor for your application, then you may wish to reduce that level of utilization.
Do you think that it would help running several VM on the host, where each VM gets exclusive access to one gpu by passthru. The host would not have any nvidia driver, but only the VM client systems.
Yes, that might help. I expect it would help, at least as far as the observation you are asking about. Having said that, I should mention that the NVIDIA-supported method for GPU passthrough/virtualization involves a vGPU (or more recently, NVIDIA AI Enterprise) license, and furthermore not all GPUs are supported in this modality (you mention “Ampere” GPUs. An Ampere A100 or A40 is supported, for example, whereas an Ampere RTX 3060 is not supported, by NVIDIA, for virtualization/pass-through). You might have luck with a “roll your own” passthrough setup, but I don’t have any recipes or instructions for you, and it would be an unsupported configuration by NVIDIA.
As was already mentioned, minimizing invocations of cudaMalloc / cudaFree (e.g. by re-using allocations) is the major lever for addressing the performance issue on the software side. However, you may also wish to examine the hardware side.
If you have multiple different host systems available for experiments, try the one with the highest single-thread performance.
In a CUDA-accellerated application, the GPU(s) take care of the throughput-dominated parallel components of the code, while the CPU is responsible for the performance of the latency-sensitive serial portions. It is possible for CUDA-accelerated applications to become bottlenecked on the serial portions, and increasingly so as GPU performance throughput increases outstrip CPU latency reductions over the past decades. Bottlenecking on serial host code has been observed in real life, so is not just a theoretical concern.
Generally speaking, memory allocation (and usually to a lesser degree, de-allocation) is a latency-sensitive activity with severely restricted parallelization opportunities. While allocators are often designed in layers, the lowest level allocator typically involves a “giant global lock”. This means that only one thread at a time can enter the critical section protected by the lock, and if there is contention for the lock, single-thread performance will therefore determine the overall delay incurred.
Neglecting CPU single-thread performance is (IMHO) a common design flaw in system nodes with GPU acceleration. The SPECspeed 2017 Integer portion of the SPEC CPU 2017 benchmarks is a useful indicator of single-thread performance; a more simplistic indicator is the CPU frequency (I usually look for a CPU frequency >= 3.5 GHz).