cudaMemcpy2DAsync not always fully synchronous


I’m calling cudaMemcpy2DAsync like so:

cudaStream_t _gpuCopyStream;
cudaStreamCreateWithFlags(&_gpuCopyStream, cudaStreamNonBlocking);

auto t_start_check1 = std::chrono::high_resolution_clock::now();
cudaMemcpy2DAsync(…, … , …, …, …, cudaMemcpyDeviceToDevice, _gpuCopyStream);
auto t_end_check1 = std::chrono::high_resolution_clock::now();
float total_check1 = std::chrono::duration<float, std::milli>(t_end_check1 - t_start_check1).count();
printf(“Time taken to cross asynchronous function: %f ms.\n”, total_check1);


When I measure the time across the cudaMemcpy2DAsync function the timing seems to spike at times (as shown in the graph below). I can understand why the time taken may spike if it’s a matter of “cudaMemcpyHostToDevice”. However in this case, it’s a “cudaMemcpyDeviceToDevice”.

Any reason for the spikes? Thanks.

Really this is just wild speculation. AFAIK device-to-device cudaMemcpy calls are usually implemented via a device kernel (under the hood). This device kernel launch has to negotiate the same kernel launch queue that other kernels go through. When the kernel launch queue is full, an “asynchronous” kernel launch becomes host-thread-blocking, until a queue spot opens up.

I believe its a possible explanation, based on what you have shown. OTOH it seems unlikely to me, because the kernel launch queue in my experience needs ~1000 or more backed up entries before it is full. So this only seems possible if your code is issuing a large amount of asynchronous work. Your cudaStreamSynchronize() would tend to empty the queue, possibly.

Thanks for your reply @Robert_Crovella.

My code is implementing quite a bit of asynchronous work.

May I also clarify a similar issue on the use of the NvDecoder.

uint8_t **ppFrame;
auto start_decode = std::chrono::high_resolution_clock::now();
_decoder->Decode(pData, nSize, &ppFrame, &pnFrameReturned, 0, nullptr, 0, _decodeStream);
auto elapsed_decode = std::chrono::high_resolution_clock::now() - start_decode;
long long microseconds_decode = std::chrono::duration_caststd::chrono::microseconds(elapsed_decode).count();
double msec_decode = microseconds_decode / 1000;

Again, I have deliberately timed the function without the use of cudaStreamSychronize(_decodeStream). I would expect this time to be also close to zero. However, the timing keeps spiking like in the case of the cudaMemcpy2DAsync discussed in the original post.


I have created the stream to be non-blocking and at high priority.
int low, high;
cudaDeviceGetStreamPriorityRange(&low, &high);
cudaStreamCreateWithPriority(&_decodeStream,cudaStreamNonBlocking, high);

Is there something wrong in my implementation of the NvDecoder or memory management?


The function call may also be hitting some sort of “queue full” condition also, causing it to convert from async to sync. That’s just a guess. There isn’t anything here I could actually run, and anyway you may get better help with one of the forums here

Re spiking timing: This isn’t by any chance happening on a Windows system using the default WDDM driver? As far as I am aware, the CUDA driver still uses launch batching to try to get around the large overhead introduced by this driver model.

Hi @njuffa,

I think my machine is set to TCC. That’s the recommended setting right?


According to the output of nvidia-smi its running with a TCC driver. And, yes, that is recommended for compute-only use when the card supports it.

I don’t have an explanation but concur with everything Robert Crovella has said in this thread so far. The spikes seem to be pretty exactly 18 frames apart, maybe that is a clue in the context of your application? Without context its not really possible to diagnose.

Hi @njuffa,

Yes, every 16 to 18 frames apart. A bit more context:

I have 4 similar processes running on 1 T4 card. Each process has 2 threads. One running the decoder to grab frames and another running a batch inference. The batch inference is triggered every 16 frames for detection and classification. I run 1 detector and 2 classifiers all asynchronously (using non-default threads) when the inference is triggered.

I count a total of no more than 7 cuda streams (all nonBlocking) per process which would mean 28 cuda streams running in total. In theory, they should not interfere with each other.

My guess is that when this inference is triggered, all other processes slow down because of some limit. Could this be right? Also, is there a limit to the number of streams I can launch based on the SM count?

That seems to suggest that the batch interference could be at the root of the problem, as the observed slowdown happens roughly in sync with that. Is this a fairly long-running kernel by any chance that is also very memory intensive?

While copies and compute kernels can run concurrently, both share the overall memory bandwidth of the GPU memory, so it is conceivable that a “heavy” kernel that uses most of the GPU memory bandwidth might be able to starve the copy operations; or vice versa. I do not have a good understanding on how the memory controllers work on modern GPUs, so this is certainly quite speculative and very handwave-y.

By way of experiment, would it be possible to split the interference kernel into four parts that run every four frames, in order to smooth out the load on the GPU memory system a bit? Presumably you would then see slowdown every four frames then, but of much smaller magnitude.

I’m not sure how you arrived at that conclusion. Work on a GPU frequently interferes with other work on a GPU. There are a variety of shared resources and bottlenecks.

There is no limit to the number of streams that can be used on a GPU that is anywhere close to single, double, or triple digit numbers. Just because you can create and use 500 streams, however, does not in any way indicate that the work associated with those streams will have no impact on each other.

With multiple processes, you are going to have context switching costs. I don’t happen to know what those might be, but the context switching cost could easily be hundreds of microseconds or more, and it might be coincident with the launch of a worker kernel in a stream. I wouldn’t have guessed that it could explain a 100ms delay, but since I don’t know, then I must conclude it might be possible.

You might have better luck doing this all in the scope of a single process. If it were linux, I would make other suggestions as well, such as using CUDA MPS and/or Triton Inference Server. I don’t know that any of these suggestions will fix anything, of course, but their use cases are relevant to the use case you seem to be describing.

Hi @Robert_Crovella

I’ve designed a simple experiment to simulate the issue. I’ve attached the cpp file in this post. The model I used can be found in this link ( It’s a yolov4 model converted from darknet to onnx.

In this experiment, I run a number of inferences continuously in parallel using the enqueue function. The streams are created to be non-blocking.

for(int streamIndex = 0; streamIndex < numberOfStreams; streamIndex++)
	cudaStreamCreateWithFlags(&mTrtCudaStream_onnx[streamIndex], cudaStreamNonBlocking);

char key = 0;
while (key != 'q')
	auto t_start_inference = std::chrono::high_resolution_clock::now();
	for (int streamIndex = 0; streamIndex < numberOfStreams; streamIndex++)
		mPredictionContext[streamIndex]->enqueueV2(&mTrtCudaBuffer_prediction[streamIndex][0], mTrtCudaStream_onnx[streamIndex], NULL);

	for (int streamIndex = 0; streamIndex < numberOfStreams; streamIndex++)

	auto t_end_inference = std::chrono::high_resolution_clock::now();
	float total_inference = std::chrono::duration<float, std::milli>(t_end_inference - t_start_inference).count();
	//std::cout << "Time taken to copy to GPU (per frame): " << total_cudaMemcpyAsync << " ms." << "\n";
	printf("Time taken for inference %f  ms.\n", total_inference);

I vary the number of streams between 1 to 5 and find that every time I increase the number of streams, the time taken increases linearly as though the enqueue function is not running asynchronously at all.

Number Of Streams Time Taken For Inference (ms)
1 48.08
2 95.09
3 143.98
4 194.46
5 239.25


SimpleInference.cpp (8.9 KB)

A few comments in no particular order:

  • This thread started out talking about cudaMemcpyAsync. Now we are talking about an inference op?
  • It appears to me you’re not measuring the time of the enqueue function, but the time of the inference op.
  • If an inference op can saturate a GPU, why would you expect that attempting to run 2 in parallel will somehow be non linear? If you start out with an expectation that stream usage causes things to run in parallel (not a true statement. It is a necessary but not sufficient condition) then there is no reason not to conclude that the GPU has infinite capacity. Such a conclusion is obviously false, therefore one of the preceding expectations is also not supportable
  • If I were trying to get best efficiency from multiple inference requests, I would start with Triton. I’m not saying it would make any difference here, but that’s what I would use. It’s engineered to give maximum efficiency when multiple requests are received, perhaps asynchronously from multiple clients.

The question seems to have evolved to a point where it is now asking “why don’t 2 operations run in parallel, at the same time, in the same amount of time as one operation, when I use streams?” This is a common question. It rests on the false premise that if a GPU can run an operation X in time Y, by using streams I can run two of X in the same time Y. When distilled down to that point, I hope it is clear that such an expectation is not supportable as a general rule.