I would like to know the cause and solution for the following occurracne of cudaMemcpy latency.
The latency increases when 2 processes on host (CPU) using 1 device (GPU) comapred to 1 process uses it.
Overview:
Situation(A): lantecy for cudaMemcpy is high(>100us) when 2 processes use 1 device.
Situation(B): lantecy for cudaMemcpy is small(~10us) when 1 processe uses 1 device.
What could cause this behavior and how to mitigate it?
I assume this is not relevant to data copy size because the data copied here is just small (32 bytes)
When multiple threads or processes use the CUDA runtime API, there is potentially an internal competition for locks which may increase the latency of a runtime API call. There are numerous reports of this on these forums, and the possibility is documented.
When 2 (or more) processes are using the same GPU (without MPS), there may be a necessity for a context-switch in between the execution of work items associated with the first process and work items associated with the second process.
Mitigation for item 1 is not a trivial matter. It might require refactoring your code to issue work from a single thread/process.
A possible mitigation avenue for the second possibility could be the use of MPS. To be clear, I’m not suggesting that if you use MPS suddenly the problem will disappear; it may have no effect at all. It’s just a possibility. If it were my code, I might try using MPS to learn and see what the impact is, if any.
cudaMemcpy is not an asynchronous operation, normally. Since the GPU is an asynchronous, latency-hiding machine, another general design principle would be to issue as much work as possible via async operations.
Also, since cudaMemcpy is blocking/synchronizing in typical usage, another possible cause of increased latency in a multiple-client case is simply that the cudaMemcpy is by definition waiting for the completion of work, and there is more work queued up to the GPU in the multiple client case.
Simple question, but length of CUDA API functions are dependent on CPU/RAM spec and independent from GPU/PCIe spec, correct? (except for some portion of the third cudaMemcpy as it first waits for the kernel execution to complete)
Pinned memory or Zero copy (cudaHostAllocMapped) did not improve the speed in my environment somehow.
So, if keeping cudaMemcpy as it is now, I am just thinking about upgrading CPU and RAM spec.
It should work to reduce overhead to some extent, for example data copy from peageable (virtual) memory to staging buffer in cudaMemcpy or processing to launch kernel.
You might want to consider compressing the data being copied. Sometimes a simple switch to a more efficient data type will help, subject to accuracy constraints. More elaborate compression schemes (some domain specific) to be used with GPUs have been published.
That’s an oversimplification, I would say. For example, the PCIe generation supported by a GPU can impact host->device copy performance. At least for a transitional time that is ending now with Blackwell shipping, many GPUs with PCIe 4 were used on platforms capable of PCIe 5. A more accurate characterization is IMHO: The performance of many CUDA API calls is dominated primarily by CPU single-thread performance, and secondarily by system memory performance. For x86-64 platforms, single-thread performance strongly correlates with processor clock frequency.
Your profiler output shows a call to cudaMemcpy(). You would want to use cudaMemcpyAsync() in conjunction with pinned memory. Since copy performance seems to dominate application-level performance, you would want to research the construction of a processing pipeline, e.g. with double-buffering and separate CUDA streams for copying data to / from the GPU, such that the uploads and downloads occur concurrently.
The performance of many CUDA API calls is dominated primarily by CPU single-thread performance, and secondarily by system memory performance. For x86-64 platforms, single-thread performance strongly correlates with processor clock frequency.
I will first try to see how much CUDA API calls are impacted with different clock frequency. In the screenshot of previous post, rows of “48.1% HtoD memcpy” and “51.9% DtoH memcpy” are shorter than CUDA API. So PCIe 4x16 should be enough for the current application.
Regarding cudaMemcpyAsync() for pipeline, is there a recommended way when using cublas fucntion?
In the current appilication Matrix-Vector-Multiplication y = Ax is calculated and call sequence is:
cudaMemcpy() //copy x from Host to Device
cublasSgemv() //calculation Ax (A is copided in advance)
cudaMemcpy() // copy y from Device to Host
As far as I understood, the data array needs to be divided into N chunks if n streams are utilized as described in the link below. What I can think of now is, dividing x,y, and A into chunks and repeating cudaMemcpyAsync of x_chunk, cublasSgemv() for x_chunk * A_chunk, and cudaMemcpyAsync for y_chunk. Finnally merging y_chunk and obtaining y overall on host.
In the modified code, we break up the array of size N into chunks of streamSize elements. Since the kernel operates independently on all elements, each of the chunks can be processed independently. https://nichijou.co/cuda9-stream2/
Sequence is like below.
A is copied prior to while loop as it is constant.
cudaMemcpy() // copy A from Host to Device
while()
{
If (x is updated), then //x can be updated by another process outside while loop
cudaMemcpy() //copy x from Host to Device
cublasSgemv() //calculate A*x on Device
cudaMemcpy() //copy y from Device to Host
}
multiplying multiple separate vectors by the same A matrix is equivalent to a matrix-matrix multiply. So one possible optimization, if you can collect some vectors together, is to copy them together as a matrix, multiply it by A, then copy the resultant matrix back.