cudaMemcpy latency increases when using 1 device with 2 processes

aakksc · February 27, 2025, 5:47am

Hello everyone,

I would like to know the cause and solution for the following occurracne of cudaMemcpy latency.
The latency increases when 2 processes on host (CPU) using 1 device (GPU) comapred to 1 process uses it.

Overview:
Situation(A): lantecy for cudaMemcpy is high(>100us) when 2 processes use 1 device.
Situation(B): lantecy for cudaMemcpy is small(~10us) when 1 processe uses 1 device.

(sorry for small letters, I can attach only one fig on post.)

GPU: RTX A2000 12GB
Bus: PCIe 4x16
Drivier version: 570.86.15
CUDA version: 12.1.105

What could cause this behavior and how to mitigate it?
I assume this is not relevant to data copy size because the data copied here is just small (32 bytes)

Thank you for support.

Robert_Crovella · February 27, 2025, 1:09pm

There are perhaps 2 reasons I can think of.

When multiple threads or processes use the CUDA runtime API, there is potentially an internal competition for locks which may increase the latency of a runtime API call. There are numerous reports of this on these forums, and the possibility is documented.
When 2 (or more) processes are using the same GPU (without MPS), there may be a necessity for a context-switch in between the execution of work items associated with the first process and work items associated with the second process.

Mitigation for item 1 is not a trivial matter. It might require refactoring your code to issue work from a single thread/process.

A possible mitigation avenue for the second possibility could be the use of MPS. To be clear, I’m not suggesting that if you use MPS suddenly the problem will disappear; it may have no effect at all. It’s just a possibility. If it were my code, I might try using MPS to learn and see what the impact is, if any.

cudaMemcpy is not an asynchronous operation, normally. Since the GPU is an asynchronous, latency-hiding machine, another general design principle would be to issue as much work as possible via async operations.

Also, since cudaMemcpy is blocking/synchronizing in typical usage, another possible cause of increased latency in a multiple-client case is simply that the cudaMemcpy is by definition waiting for the completion of work, and there is more work queued up to the GPU in the multiple client case.

aakksc · March 2, 2025, 6:23am

Hi @Robert_Crovella

Thank you for comments.
MPS worked. Most probably the latency was caused by the overhead due to GPU context switch.

Now I am thinking how to shorten the execution time of CUDA API in the box below.

Simple question, but length of CUDA API functions are dependent on CPU/RAM spec and independent from GPU/PCIe spec, correct? (except for some portion of the third cudaMemcpy as it first waits for the kernel execution to complete)

Pinned memory or Zero copy (cudaHostAllocMapped) did not improve the speed in my environment somehow.
So, if keeping cudaMemcpy as it is now, I am just thinking about upgrading CPU and RAM spec.
It should work to reduce overhead to some extent, for example data copy from peageable (virtual) memory to staging buffer in cudaMemcpy or processing to launch kernel.

Thank you.

njuffa · March 2, 2025, 7:11pm

You might want to consider compressing the data being copied. Sometimes a simple switch to a more efficient data type will help, subject to accuracy constraints. More elaborate compression schemes (some domain specific) to be used with GPUs have been published.

That’s an oversimplification, I would say. For example, the PCIe generation supported by a GPU can impact host->device copy performance. At least for a transitional time that is ending now with Blackwell shipping, many GPUs with PCIe 4 were used on platforms capable of PCIe 5. A more accurate characterization is IMHO: The performance of many CUDA API calls is dominated primarily by CPU single-thread performance, and secondarily by system memory performance. For x86-64 platforms, single-thread performance strongly correlates with processor clock frequency.

Your profiler output shows a call to cudaMemcpy(). You would want to use cudaMemcpyAsync() in conjunction with pinned memory. Since copy performance seems to dominate application-level performance, you would want to research the construction of a processing pipeline, e.g. with double-buffering and separate CUDA streams for copying data to / from the GPU, such that the uploads and downloads occur concurrently.

aakksc · March 3, 2025, 1:44pm

Hello @njuffa,

THank you for advice.

The performance of many CUDA API calls is dominated primarily by CPU single-thread performance, and secondarily by system memory performance. For x86-64 platforms, single-thread performance strongly correlates with processor clock frequency.

I will first try to see how much CUDA API calls are impacted with different clock frequency. In the screenshot of previous post, rows of “48.1% HtoD memcpy” and “51.9% DtoH memcpy” are shorter than CUDA API. So PCIe 4x16 should be enough for the current application.

Regarding cudaMemcpyAsync() for pipeline, is there a recommended way when using cublas fucntion?
In the current appilication Matrix-Vector-Multiplication y = Ax is calculated and call sequence is:

cudaMemcpy() //copy x from Host to Device
cublasSgemv() //calculation Ax (A is copided in advance)
cudaMemcpy() // copy y from Device to Host

As far as I understood, the data array needs to be divided into N chunks if n streams are utilized as described in the link below. What I can think of now is, dividing x,y, and A into chunks and repeating cudaMemcpyAsync of x_chunk, cublasSgemv() for x_chunk * A_chunk, and cudaMemcpyAsync for y_chunk. Finnally merging y_chunk and obtaining y overall on host.

In the modified code, we break up the array of size N into chunks of streamSize elements. Since the kernel operates independently on all elements, each of the chunks can be processed independently.
https://nichijou.co/cuda9-stream2/

Curefab · March 3, 2025, 2:51pm

If A is copied in advance, do you need to do the operation for several x repeatedly? Or what makes A special or different to x?

aakksc · March 3, 2025, 3:00pm

Hello @Curefab,

Sequence is like below.
A is copied prior to while loop as it is constant.

cudaMemcpy() // copy A from Host to Device
while()
{
If (x is updated), then //x can be updated by another process outside while loop
cudaMemcpy() //copy x from Host to Device
cublasSgemv() //calculate A*x on Device
cudaMemcpy() //copy y from Device to Host
}

Robert_Crovella · March 3, 2025, 3:11pm

multiplying multiple separate vectors by the same A matrix is equivalent to a matrix-matrix multiply. So one possible optimization, if you can collect some vectors together, is to copy them together as a matrix, multiply it by A, then copy the resultant matrix back.

Topic		Replies	Views
Is cudaMemcpy() real-time safe? CUDA Programming and Performance cuda	11	562	March 30, 2024
cudaMemcpy latency between sequencial calls CUDA Programming and Performance	2	27	May 1, 2025
cudaMemcpy() Best approach when you need to call it many times? CUDA Programming and Performance	8	25115	March 8, 2010
cudaMemcpy latency unusually high on some machines CUDA Programming and Performance	9	171	November 11, 2024
Slow memory transfers CUDA Programming and Performance	7	2000	May 23, 2011
multidevice got slower than single gpu device CUDA Programming and Performance	14	2733	October 23, 2010
A built in way to quickly convert three float arrays into a single float3 array CUDA Programming and Performance	13	3924	December 20, 2013
Device to Device cudaMemcpy performance CUDA Programming and Performance cuda	5	10670	March 24, 2021
cudaMemcpy2DAsync a lot slower than cudaMemcpy normally CUDA Programming and Performance	6	144	August 22, 2024
cudaMemcpy takes 30% of my project time. CUDA Programming and Performance	5	4401	July 20, 2009

cudaMemcpy latency increases when using 1 device with 2 processes

Related topics