cudaMemcpy2DAsync not always fully synchronous

ColinPs26kt · October 9, 2020, 9:44am

Hi,

I’m calling cudaMemcpy2DAsync like so:

cudaStream_t _gpuCopyStream;
cudaStreamCreateWithFlags(&_gpuCopyStream, cudaStreamNonBlocking);

auto t_start_check1 = std::chrono::high_resolution_clock::now();
cudaMemcpy2DAsync(…, … , …, …, …, cudaMemcpyDeviceToDevice, _gpuCopyStream);
auto t_end_check1 = std::chrono::high_resolution_clock::now();
float total_check1 = std::chrono::duration<float, std::milli>(t_end_check1 - t_start_check1).count();
printf(“Time taken to cross asynchronous function: %f ms.\n”, total_check1);

cudaStreamSynchronize(_gpuCopyStream);

When I measure the time across the cudaMemcpy2DAsync function the timing seems to spike at times (as shown in the graph below). I can understand why the time taken may spike if it’s a matter of “cudaMemcpyHostToDevice”. However in this case, it’s a “cudaMemcpyDeviceToDevice”.

Any reason for the spikes? Thanks.

Robert_Crovella · October 15, 2020, 11:17pm

Really this is just wild speculation. AFAIK device-to-device cudaMemcpy calls are usually implemented via a device kernel (under the hood). This device kernel launch has to negotiate the same kernel launch queue that other kernels go through. When the kernel launch queue is full, an “asynchronous” kernel launch becomes host-thread-blocking, until a queue spot opens up.

I believe its a possible explanation, based on what you have shown. OTOH it seems unlikely to me, because the kernel launch queue in my experience needs ~1000 or more backed up entries before it is full. So this only seems possible if your code is issuing a large amount of asynchronous work. Your cudaStreamSynchronize() would tend to empty the queue, possibly.

ColinPs26kt · November 12, 2020, 10:01am

Thanks for your reply @Robert_Crovella.

My code is implementing quite a bit of asynchronous work.

May I also clarify a similar issue on the use of the NvDecoder.

uint8_t **ppFrame;
auto start_decode = std::chrono::high_resolution_clock::now();
_decoder->Decode(pData, nSize, &ppFrame, &pnFrameReturned, 0, nullptr, 0, _decodeStream);
auto elapsed_decode = std::chrono::high_resolution_clock::now() - start_decode;
long long microseconds_decode = std::chrono::duration_caststd::chrono::microseconds(elapsed_decode).count();
double msec_decode = microseconds_decode / 1000;

Again, I have deliberately timed the function without the use of cudaStreamSychronize(_decodeStream). I would expect this time to be also close to zero. However, the timing keeps spiking like in the case of the cudaMemcpy2DAsync discussed in the original post.

I have created the stream to be non-blocking and at high priority.
int low, high;
cudaDeviceGetStreamPriorityRange(&low, &high);
cudaStreamCreateWithPriority(&_decodeStream,cudaStreamNonBlocking, high);

Is there something wrong in my implementation of the NvDecoder or memory management?

Thanks.

Robert_Crovella · November 12, 2020, 2:44pm

The function call may also be hitting some sort of “queue full” condition also, causing it to convert from async to sync. That’s just a guess. There isn’t anything here I could actually run, and anyway you may get better help with one of the forums here

njuffa · November 12, 2020, 7:31pm

Re spiking timing: This isn’t by any chance happening on a Windows system using the default WDDM driver? As far as I am aware, the CUDA driver still uses launch batching to try to get around the large overhead introduced by this driver model.

ColinPs26kt · November 13, 2020, 5:07am

Hi @njuffa,

I think my machine is set to TCC. That’s the recommended setting right?

Colin.

njuffa · November 13, 2020, 5:15am

According to the output of nvidia-smi its running with a TCC driver. And, yes, that is recommended for compute-only use when the card supports it.

I don’t have an explanation but concur with everything Robert Crovella has said in this thread so far. The spikes seem to be pretty exactly 18 frames apart, maybe that is a clue in the context of your application? Without context its not really possible to diagnose.

ColinPs26kt · December 9, 2020, 5:39am

Hi @njuffa,

Yes, every 16 to 18 frames apart. A bit more context:

I have 4 similar processes running on 1 T4 card. Each process has 2 threads. One running the decoder to grab frames and another running a batch inference. The batch inference is triggered every 16 frames for detection and classification. I run 1 detector and 2 classifiers all asynchronously (using non-default threads) when the inference is triggered.

I count a total of no more than 7 cuda streams (all nonBlocking) per process which would mean 28 cuda streams running in total. In theory, they should not interfere with each other.

My guess is that when this inference is triggered, all other processes slow down because of some limit. Could this be right? Also, is there a limit to the number of streams I can launch based on the SM count?

njuffa · December 9, 2020, 9:15am

That seems to suggest that the batch interference could be at the root of the problem, as the observed slowdown happens roughly in sync with that. Is this a fairly long-running kernel by any chance that is also very memory intensive?

While copies and compute kernels can run concurrently, both share the overall memory bandwidth of the GPU memory, so it is conceivable that a “heavy” kernel that uses most of the GPU memory bandwidth might be able to starve the copy operations; or vice versa. I do not have a good understanding on how the memory controllers work on modern GPUs, so this is certainly quite speculative and very handwave-y.

By way of experiment, would it be possible to split the interference kernel into four parts that run every four frames, in order to smooth out the load on the GPU memory system a bit? Presumably you would then see slowdown every four frames then, but of much smaller magnitude.

Robert_Crovella · December 9, 2020, 2:53pm

I’m not sure how you arrived at that conclusion. Work on a GPU frequently interferes with other work on a GPU. There are a variety of shared resources and bottlenecks.

There is no limit to the number of streams that can be used on a GPU that is anywhere close to single, double, or triple digit numbers. Just because you can create and use 500 streams, however, does not in any way indicate that the work associated with those streams will have no impact on each other.

With multiple processes, you are going to have context switching costs. I don’t happen to know what those might be, but the context switching cost could easily be hundreds of microseconds or more, and it might be coincident with the launch of a worker kernel in a stream. I wouldn’t have guessed that it could explain a 100ms delay, but since I don’t know, then I must conclude it might be possible.

You might have better luck doing this all in the scope of a single process. If it were linux, I would make other suggestions as well, such as using CUDA MPS and/or Triton Inference Server. I don’t know that any of these suggestions will fix anything, of course, but their use cases are relevant to the use case you seem to be describing.

ColinPs26kt · February 4, 2021, 6:25am

Hi @Robert_Crovella

I’ve designed a simple experiment to simulate the issue. I’ve attached the cpp file in this post. The model I used can be found in this link (https://xrvisionpteltd-my.sharepoint.com/:f:/g/personal/colinp_xrvision_ai/Eqhv4UJBxRJEk6v-wbPeGd8Bkd1IVa7sgYQvsRTj553WIg?e=D7YWnb). It’s a yolov4 model converted from darknet to onnx.

In this experiment, I run a number of inferences continuously in parallel using the enqueue function. The streams are created to be non-blocking.

for(int streamIndex = 0; streamIndex < numberOfStreams; streamIndex++)
	cudaStreamCreateWithFlags(&mTrtCudaStream_onnx[streamIndex], cudaStreamNonBlocking);

char key = 0;
while (key != 'q')
{
	auto t_start_inference = std::chrono::high_resolution_clock::now();
	for (int streamIndex = 0; streamIndex < numberOfStreams; streamIndex++)
		mPredictionContext[streamIndex]->enqueueV2(&mTrtCudaBuffer_prediction[streamIndex][0], mTrtCudaStream_onnx[streamIndex], NULL);

	for (int streamIndex = 0; streamIndex < numberOfStreams; streamIndex++)
		cudaStreamSynchronize(mTrtCudaStream_onnx[streamIndex]);

	auto t_end_inference = std::chrono::high_resolution_clock::now();
	float total_inference = std::chrono::duration<float, std::milli>(t_end_inference - t_start_inference).count();
	//std::cout << "Time taken to copy to GPU (per frame): " << total_cudaMemcpyAsync << " ms." << "\n";
	printf("Time taken for inference %f  ms.\n", total_inference);
}

I vary the number of streams between 1 to 5 and find that every time I increase the number of streams, the time taken increases linearly as though the enqueue function is not running asynchronously at all.

Number Of Streams	Time Taken For Inference (ms)
1	48.08
2	95.09
3	143.98
4	194.46
5	239.25

SimpleInference.cpp (8.9 KB)

Robert_Crovella · February 4, 2021, 3:29pm

A few comments in no particular order:

This thread started out talking about cudaMemcpyAsync. Now we are talking about an inference op?
It appears to me you’re not measuring the time of the enqueue function, but the time of the inference op.
If an inference op can saturate a GPU, why would you expect that attempting to run 2 in parallel will somehow be non linear? If you start out with an expectation that stream usage causes things to run in parallel (not a true statement. It is a necessary but not sufficient condition) then there is no reason not to conclude that the GPU has infinite capacity. Such a conclusion is obviously false, therefore one of the preceding expectations is also not supportable
If I were trying to get best efficiency from multiple inference requests, I would start with Triton. I’m not saying it would make any difference here, but that’s what I would use. It’s engineered to give maximum efficiency when multiple requests are received, perhaps asynchronously from multiple clients.

The question seems to have evolved to a point where it is now asking “why don’t 2 operations run in parallel, at the same time, in the same amount of time as one operation, when I use streams?” This is a common question. It rests on the false premise that if a GPU can run an operation X in time Y, by using streams I can run two of X in the same time Y. When distilled down to that point, I hope it is clear that such an expectation is not supportable as a general rule.

Topic		Replies	Views
Cannot get any stream parallelism. CUDA Programming and Performance	13	1233	December 31, 2019
cudaMemcpyAsync CUDA Programming and Performance	10	20160	October 16, 2015
cudaMemcpyAsync, unexpected behaviour while using cudaStreamNonBlocking? CUDA Programming and Performance	6	2005	May 29, 2018
Time intervals and non-concurrent in multi streaming CUDA Programming and Performance cuda	6	562	April 6, 2023
Streams and multiprocessor usage? CUDA Programming and Performance	3	2893	September 20, 2008
Why does cudaStreamAddCallback serialize kernel execution and break concurrency? CUDA Programming and Performance	12	7906	April 5, 2015
Cuda code performance CUDA Programming and Performance	14	3077	December 16, 2014
Memset/memcpyDtoD implicitly synchronizes all streams -- a way to disable it? CUDA Programming and Performance	5	524	August 23, 2023
Unpredictable nature of GPU action timing in Nsight CUDA Programming and Performance	10	1202	October 18, 2015
The Cuda 5 Second execution-time limit Finding a the way to work around the GDI timeout CUDA Programming and Performance	24	12680	July 26, 2010

cudaMemcpy2DAsync not always fully synchronous

Related topics