Performance problem when loading multiple GPU system with independent simulations

marti.1800 · June 27, 2024, 2:12pm

Hello,

In our company, we have an in-house CUDA code that has been developed 10 years ago.
This code is running on a multi GPU machine ( 2xThreadripper + 8x Ampere GPU) using one GPU (in single precision).
When launching 8 of those simulation on one machine, then the performance breaks down.
Profiling the application shows that when many computations run on the same machine, then the execution times for calls CudaMalloc and CudaFree increase drastically (~x10).

Remarks:

The different computations on one Machine do not communicating one with the others.
The code was planned as a multi-gpu code, therefore there is a declaration of DeviceToDevice communication:

//==================================================================================================

void AsCudaMemory::MemSet(void * _device_ptr, int _value, std::size_t _size)
{ if (_size != 0) CUDA_CHECK(cudaMemset(_device_ptr, _value, _size)); return; }

//--------------------------------------------------------------------------------------------------

void AsCudaMemory::MemCopy(const void * _device_ptr0, void * _device_ptr1, std::size_t _size)
{ if (_size != 0) CUDA_CHECK(cudaMemcpy(_device_ptr1, _device_ptr0, _size, cudaMemcpyDeviceToDevice)); return; }

//==================================================================================================

void AsCudaMemory::UnRegisterHost(void * _host_ptr)
{ if (_host_ptr != NULL) CUDA_CHECK(cudaHostUnregister(_host_ptr)); return; }

//--------------------------------------------------------------------------------------------------

void AsCudaMemory::RegisterHost(void * _host_ptr, std::size_t _size)
{ if (_size != 0) CUDA_CHECK(cudaHostRegister(_host_ptr, _size, cudaHostRegisterMapped)); return; }

//==================================================================================================

Robert_Crovella · June 27, 2024, 2:46pm

This is a known observation, and you can find other similar reports on these forums. The CUDA documentation makes a general provision for this observation here:

Any CUDA API call may block or synchronize for various reasons such as contention for or unavailability of internal resources.

If you are launching 8 independent simulations, each of which is using a single unique GPU, then you might try launching each instance with a CUDA_VISIBLE_DEVICES="X" preamble, where X takes on a value from 0 to 7 for each unique/independent sim. This will limit the “visibility” of the CUDA runtime in each case, and it may help.

e.g.:

CUDA_VISIBLE_DEVICES="0" ./sim
CUDA_VISIBLE_DEVICES="1" ./sim

and so on. CUDA_VISIBLE_DEVICES has a device index remapping characteristic. So currently, if you are telling each individual invocation of ./sim which GPU to use, such as:

./sim 0
./sim 1

etc., then you would want to modify the invocation to something like:

CUDA_VISIBLE_DEVICES="0" ./sim 0
CUDA_VISIBLE_DEVICES="1' ./sim 0

and so on.

marti.1800 · June 27, 2024, 3:15pm

Many thanks for the reply!

The simulation is run thru a SLURM queuing system which sets the CUDA_VISIBLE_DEVICES variable.
But this does not improve the performance.

Robert_Crovella · June 27, 2024, 3:25pm

I don’t have any further suggestions for that, then (assuming your SLURM assigns a single unique device to each process).

If your application is bound by cudaMalloc/cudaFree performance, then you might want to see if you can reduce the use of those APIs, for example reusing allocations, or switch to a pool allocator that you manage yourself (CUDA has a pool allocator available, but I don’t happen to know if it is subject to this multi-thread/multi-process contention issue).

marti.1800 · June 27, 2024, 3:47pm

Yes, SLURM assigns the variable correctly. (Checked by echo $CUDA_VISIBLE_DEVICES)

If I understand your idea correctly, you wanted to isolate the visible GPUs for one execution thread to one gpu in order to prevent the sync to be expanded to all GPU.

Is that what could be controlled by setting

cudaDeviceSetLimit(cudaLimitDevRuntimeSyncDepth, 0);

Robert_Crovella · June 27, 2024, 3:52pm

No, that is not related.

It’s not related to synchronizations. It’s connected to contention for a shared internal host-based resource managed by the CUDA runtime, where the access control often involves acquisition of a lock. The contention for the lock (and indeed simultaneous access to the shared resource) is causing the increase in the time duration of cudaMalloc/cudaFree. None of this is documented (the above link indicates this aspect of CUDA runtime behavior is explicitly undocumented, and subject to change), but you can find posts on these forums where people have provided evidence that locks are being contended for, in at least some of these cases.

marti.1800 · June 27, 2024, 4:01pm

Does that mean that this is a driver issue, and that changing the driver might solve that problem?

Robert_Crovella · June 27, 2024, 4:08pm

I personally doubt that changing the driver will help.

You’re welcome to try it. It’s not a bad idea, whenever having a problem with CUDA GPUs, to update the GPU driver to the latest available.

But this issue is something that has persisted for quite some time (many years, in my experience) and its clear that the issue is known to the CUDA designers, otherwise why the doc statement I provided?

So I’m not optimistic that changing a driver would help.

You can call it a “driver issue” if you wish, but I would suggest it is probably happening by design.

The only suggestion I have to offer is the one I already made. “Don’t do that.” If you are making extensive use of cudaMalloc/cudaFree such that its performance is the limiting factor for your application, then you may wish to reduce that level of utilization.

marti.1800 · June 27, 2024, 4:10pm

Many thanks for that professional and clear support!

marti.1800 · June 28, 2024, 7:46am

Do you think that it would help running several VM on the host, where each VM gets exclusive access to one gpu by passthru. The host would not have any nvidia driver, but only the VM client systems.

Robert_Crovella · June 28, 2024, 2:20pm

Yes, that might help. I expect it would help, at least as far as the observation you are asking about. Having said that, I should mention that the NVIDIA-supported method for GPU passthrough/virtualization involves a vGPU (or more recently, NVIDIA AI Enterprise) license, and furthermore not all GPUs are supported in this modality (you mention “Ampere” GPUs. An Ampere A100 or A40 is supported, for example, whereas an Ampere RTX 3060 is not supported, by NVIDIA, for virtualization/pass-through). You might have luck with a “roll your own” passthrough setup, but I don’t have any recipes or instructions for you, and it would be an unsupported configuration by NVIDIA.

njuffa · June 28, 2024, 5:52pm

As was already mentioned, minimizing invocations of cudaMalloc / cudaFree (e.g. by re-using allocations) is the major lever for addressing the performance issue on the software side. However, you may also wish to examine the hardware side.

If you have multiple different host systems available for experiments, try the one with the highest single-thread performance.

In a CUDA-accellerated application, the GPU(s) take care of the throughput-dominated parallel components of the code, while the CPU is responsible for the performance of the latency-sensitive serial portions. It is possible for CUDA-accelerated applications to become bottlenecked on the serial portions, and increasingly so as GPU performance throughput increases outstrip CPU latency reductions over the past decades. Bottlenecking on serial host code has been observed in real life, so is not just a theoretical concern.

Generally speaking, memory allocation (and usually to a lesser degree, de-allocation) is a latency-sensitive activity with severely restricted parallelization opportunities. While allocators are often designed in layers, the lowest level allocator typically involves a “giant global lock”. This means that only one thread at a time can enter the critical section protected by the lock, and if there is contention for the lock, single-thread performance will therefore determine the overall delay incurred.

Neglecting CPU single-thread performance is (IMHO) a common design flaw in system nodes with GPU acceleration. The SPECspeed 2017 Integer portion of the SPEC CPU 2017 benchmarks is a useful indicator of single-thread performance; a more simplistic indicator is the CPU frequency (I usually look for a CPU frequency >= 3.5 GHz).

Topic		Replies	Views
Memory allocation problem with multi-gpu (Tesla k80), possible cuda driver bug CUDA Programming and Performance	5	4166	February 20, 2016
Strange performance regression with a single GPU context on a multi GPU host CUDA Programming and Performance	11	1177	April 7, 2021
Performance drop after specifying CUDA_VISIBLE_DEVICES=0 CUDA Programming and Performance cuda	6	458	April 5, 2024
Interleaving cudaMalloc and kernels on multiple cpu threads - performance? CUDA Programming and Performance	6	1552	March 5, 2018
Multiple GPUs Devise a synchro mechanism for host threads CUDA Programming and Performance	7	4340	May 13, 2010
multiGPU poor performance up to 10x lowest performance in multiGPU CUDA Programming and Performance	14	10995	January 18, 2008
CudaMalloc fails when more of 2 linux process acces to the GPU 0 CUDA Programming and Performance	2	1202	February 24, 2009
Multi GPU performance degrade when allocated memory increases. CUDA Programming and Performance	3	1071	June 15, 2013
Questions about cudaMalloc Questions about runtime for cudaMalloc and cudaMemcpy CUDA Programming and Performance	1	3397	June 23, 2009
[Multiple GPUs / Processes] CUDA Memory De/Allocation Slow CUDA Programming and Performance	25	10196	December 4, 2017

Performance problem when loading multiple GPU system with independent simulations

Related topics