Bug in cudaMemsetAsync or in Nsight VS Edition when visualizing cudaMemsetAsync execution

When using a single cuda stream, looking at the execution with NSIGHT Visual Studio Edition, the execution of cudaMemsetAsync does not respect the order in which the command was enqueued. It always executes during the cuda runtime call, and the rest of commands (memory copies, kernels, etc) execute later.

I provide a code that reproduces the effect:

#include <cuda.h>
#include <cuda_runtime.h>
#include <iostream>

typedef unsigned int uint;

#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code,
                      const char *file,
                      int line,
                      bool abort = true) {
    if (code != cudaSuccess) {
        fprintf(stderr, "GPUassert: %s %s %d\n",
                cudaGetErrorString(code), file, line);
        if (abort) exit(code);

__global__ void kernel_setAZero(uint* data) {
    *data = 0u;

void launch_setAZero(uint* data, cudaStream_t stream) {
    kernel_setAZero <<<1, 1, 0, stream>>>(data);

int main() {
    uint *data;
    gpuErrchk(cudaMalloc(&data, sizeof(uint)));

    cudaStream_t stream;

    launch_setAZero(data, stream);
    gpuErrchk(cudaMemsetAsync(data, 0, sizeof(uint), stream));


    std::cout << "Executed!!" << std::endl;

    return 0;

As you can see in the code, using a single stream first the kernel “kernel_setAZero” is enqueued in the stream, and later is the cudaMemsetAsync, on the same stream.

If you use NSight Visual Studio to look at it, you will see that the execution order visualized is the inverse of the expected.

Additionally, using the cudaMemsetAsync prevents the overlapping of transfers and computation under certain conditions, but I’m not sure if this is expected behavior on Windows.

System configuration:

widnows 10 pro 17334.286
vs2017 update 8 (compiler version 14.15)
Quadro P4000 (also tested on Quadro M4000 and Quadro P6000)
Tested both on CUDA 9.1 (WDDM driver 391.03) and CUDA 10.0 (WDDM driver 411.63)
NSight Visual Studio edition 5.5 for CUDA 9.1
NSight Visual Studio edition 6.0 for CUDA 10.0

We implemented our own version of cudaMemsetAsync using different kernel versions, according to the amount of elements to write, and using large vector data types to make it faster.

We even support writing vertor type elements, with different values on each vector component, so we are not limited to writting bytes.

Individual speeds, comparing with native cudaMemsetAsync are quite similar, sometimes equal, sometimes a bit slower, sometimes a bit faster.

But most importantly, we noticed the bug reported (on Windows 10) when the number of bytes to write is not very big (I can’t give a number, I didn’t do the search). Then, the behavior is exactly as described on the previous post.

This was causing, in certain configurations of our AI based video content production software, up to 20% of execution time lost per frame.

Now that the memset is a CUDA kernel, it also nicely overlaps with other kernels, making it almost free for small arrays, when before, writing a single byte was making us to lose 20% of execution time.

I can’t provide the code, but just for NVIDIA to know that cudaMemsetAsync seem’s to have a bug, and/or could be much better implemented, at least on Windows.

And yes, you can ask “why would yo need to do a memset?”, but that’s not the question. My AI programers want to use it, and it should properly work. Let them code, and make CUDA perform for them.

I ran your code using nvprof and observed the expected execution order, first the kernel, then the memset operation after the kernel completes.

Then, the bug is in Nsight Visual Studio Edition. Have you tested that?

What about the other side effect of cudaMallocAsync? Is it expected that a cudaMallocAsync prevents memory transfers and computation to overlap?

Do you need screenshots of nsight (wrongly) showing the inverse execution order of this code?

Do you need me to provide code that overlaps transfers and computation, and another that the only change is to include a cudaMemsetAsync, and then the overlapping does not happen?


I would need for you to tell me exactly, step-by-step, the process you followed in Nsight VSE.
Alternatively, it may be simpler for you to just file a bug at developer.nvidia.com

If you can reproduce my observation about running nvprof from the command line showing correct order of execution, then it may be a defect in Nsight VSE.

I assume you’re talking about cudaMemsetAsync, there is no such thing as cudaMallocAsync

I’m not aware of issues. However I rarely try to make informed decisions about overlap behavior in a WDDM setup.


I know it’s been a loooong time, but I’m hitting the issue with cudaMemsetAsync and overlapping of memory transfers and kernels again.

This time the issue is quite severe. A pair of cudaMemsetAsync calls, apparently internal to cuDNN (would like confirmation about that), is delaying some cudaMemcpyAsync copies (H2D and D2H always with pinned memory), for 10ms. We have a real time application where each iteration must last 33ms at most. So this delay it’s basically breaking the whole application :-D

So, first, could someone confirm that I understood the NVIDIA documentation correctly?
It says that cudaMemset’s (does not specify wether async or not, I’m assuming both), will prevent the parallel execution of two cuda calls in two different streams, if the memset is in between them.

The concept “in between them” I’m assuming means that there is some sort of relative order in the commands stored in each stream. Is that right?

I did not reply to your answer, because we implemented our own memset kernels, with different implementations for different data sizes and memory alignments, with comparable performance to the nvidia ones, and without the overlapping problem.

The issue here, is that the cuda memsets seem to be called from inside the cuDNN library, so we can not change that. Could any one confirm that cuDNN does call cudaMemsetAsync sometimes?

By the way, it is still under WDDM, but this time it’s with Hardware Scheduling ON, on an RTX Ampere GPU (not sure if I’m allowed to say the model already, it’s an early seed) and also in a Quadro RTX4000.


I would ask that question on the cuDNN forum. As a general rule it seems self-evident that any particular CUDA library call may use any of the cuda runtime API calls, as it sees fit.

Regarding your other question:

I observe in at least one case, overlap of a D->H operation with a H->D operation, even though there is a cudaMemsetAsync in-between them:

$ cat t1917.cu
#include <iostream>
const size_t sz = 1048576ULL*1024;
int main(){

  int *d1, *d2, *d3;
  int *h1, *h2;

  cudaHostAlloc(&h1, sz, cudaHostAllocDefault);
  cudaHostAlloc(&h2, sz, cudaHostAllocDefault);
  cudaMalloc(&d1, sz);
  cudaMalloc(&d2, sz);
  cudaMalloc(&d3, sizeof(int));
  cudaStream_t s1, s2, s3;
  cudaMemcpyAsync(d1, h1, sz, cudaMemcpyHostToDevice, s1);
  cudaMemsetAsync(d3, 0, sizeof(int), s3);
  cudaMemcpyAsync(h2, d2, sz, cudaMemcpyDeviceToHost, s2);

$ nvcc -o t1917 t1917.cu
$ nvprof --print-gpu-trace ./t1917
==10057== NVPROF is profiling process 10057, command: ./t1917
==10057== Profiling application: ./t1917
==10057== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput  SrcMemType  DstMemType           Device   Context    Stream  Name
1.41643s  101.48ms                    -               -         -         -         -  1.0000GB  9.8538GB/s      Pinned      Device  Tesla V100-PCIE         1        16  [CUDA memcpy HtoD]
1.41647s  6.5920us                    -               -         -         -         -        4B  592.57KB/s      Device           -  Tesla V100-PCIE         1        18  [CUDA memset]
1.41648s  101.71ms                    -               -         -         -         -  1.0000GB  9.8319GB/s      Device      Pinned  Tesla V100-PCIE         1        17  [CUDA memcpy DtoH]

Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy

CUDA 11.4, CentOS 7, Tesla V100

Judging from the nvprof output, and arbitrarily declaring time t=1.416000 to instead be time t=0, then I observe that:

  1. A cudaMemcpyAsync H->D operation is issued first. It begins at the 430 microsecond mark and has a duration of about 100 milliseconds.
  2. A cudaMemsetAsync operation is issued next. It begins at the 470 microsecond mark, and has a duration of about 7 microseconds. It therefore is overlapped with the operation in item 1 above.
  3. A cudaMemcpyAsyncD->H operation is issued next. It begins at the 480 microsecond mark, and it has a duration of about 100 milliseconds. It therefore is overlapped with the operation in item 1 above. The presence of the intervening cudaMemsetAsync has not resulted in a “delayed start” (in my opinion) and has not resulted in the inability of the operation to overlap other previous operations.

I don’t think it is that relevant, but I attribute the 10-40 microsecond differences in start times to be ordinary asynchronous overhead.

It’s entirely possible the behavior is different under WDDM. Furthermore, I generally don’t try to characterise WDDM operation in this way, because of the possibility of WDDM command batching, which has been covered in a number of other posts on these and other forums.

Neither the operating systems supported by CUDA, nor the CUDA software layers, nor the libraries associated with CUDA make any sort of promises regarding real-time capability. There are no guaranteed upper time bounds on operations, it is all best effort only.

What you observe may be due to a software bug of some sort, but trying to build what sounds like an app with hard real-time requirements with CUDA is not advisable. CUDA on GPUs can accelerate soft real-time applications that degrade gracefully (e.g. by dropping frames or lowering resolution) when deadlines are missed and are generally not critical (not life-sustaining, not life-supporting, non-hazardous, etc) .

Hi njuffa,

We have been selling our product for 5 years, and of course it is soft real-time. We do not skip any frame, but we do have buffers and delays, to absorve fluctuations.

But the problem I’m describing here it’s not a random fluctuation. It’s consistent and very characterizable. It’s a CUDA runtime decission to not overlap some memory transfers, under certain specific conditions. More on that on the next reply, thanks.

To find out whether that decision is intentional (i.e. works as designed) or unintentional (i.e. a bug) you could file a bug report with NVIDIA.

1 Like

Hi Robert,

First of all, have you ever tried WDDM 2.7 with Hardware Scheduling ON?? It basically behaves almost like TCC. Which in turn behaves as Linux. So please, could you have a look at what I’m about to explain? It’s a very CONSISTENT and CHARACTERIZABLE problem.

We continued working on the issue, and despite that cuDNN is using cudaMemsetAsync, we found out that the issue is not that, but a pair of cudaMemcpyAsync(D2H) in OpenCV. Fortunately, OpenCV is open source, and we already fixed it.

We found out because we created a “fake” inference function, that recreates the same cuda launches that OpenCV+cuDNN are doing. Similar number of kernels (dummy kernels in this case) and same cudaMemsetAsync and cudaMemcpyAsync calls, with the same streams with the same flags, and the same events with the same flags.

The issue only stopped happening when we removed the cudaMemcpyAsync.

Let me show you, with a use case that works despite the issue, just because it’s easier to visualize the problem and explain it.


As you can see in the timeline above, the second and fourth pack of PCIe transfers do not happen until all the kernels finish their execution. Actually, those transfers are all executed in the same stream (overlapping uploads and downloads result in worse overall transfer times in our systems, probably due to CPU RAM R/W bandwidth limitations) and are completely independent of the kernels, which execute in a different stream than the transfers, and the two streams, kernels and transfers, are not synchronized with any event.

Therefore, here the CUDA runtime decides to UNECESSARILY delay those transfers.

Here is the fixed version:


As you can see in this fixed version, the second and fourth “transfer packs”, now overlap with the kernels. Each “transfer pack” corresponds to an iteration in our software, so we could say we are seeing T, T+1, T+2, T+3. The Deep Learning part, it’s 70-80% of the kernels, and it’s only triggered to execute on T and T+2.

Now, in this second timeline we can see that after the DL kernels finish, there are some small D2H transfers. Those are the transfers that were causing the problem.

And the reason is the following:

In that timeline, we can see the detail of the streams being used, the kernel calls, and the memory transfers.

We are visualizing T and T+1, and one entire DL execution.

We can also see 2 CPU threads. One of them is executing the CUDA calls for T and T+1, and the other one the DL execution for T.

In Blue lines we circled both the CPU calls and the corresponding GPU executions for T, in red the DL and in black T+1.

As you can see, the calls for the DL and T are happening more or less at the same time, and after that, we ask for T+1.

In Yellow we circled two small D2H transfers, that are the ones making the transfers for T+1 to wait, and not overlap. It seems that the runtime CONSISTENTLY decides to make all transfers that are enqueued in any stream, to wait for the transfers that were previously enqueued in another stream.

Here is the detail for the fixed version:

So, in the fixed version, we simply modified the OpenCV code, so that the downloads are enqueued from the Host code, only after waiting for the execution of all the DL kernels. This way, even if some transfers had to wait for the DL transfers to finish, they would have to wait only for the transfers, and not for the kernels + transfers. As you can see, the T+1 transfers now overlap nicely with the rest of the kernels.

Question now: the decision that the CUDA runtime makes to order the PCIe memory transfers according to FIFO instead of using the dependency information in the streams, is an expected behavior?


By the way, in the use case where we were having the biggest performance problems, we were loosing 10 milli seconds, not micro seconds.

ms = milliseconds
us = microseconds
1 ms = 1000 us


I don’t really know. If there is a gap in activity on a transfer path, I would generally expect the GPU to schedule a pending transfer in that area, if possible. If the GPU has two pending D->H transfers, for example, and one is ready to schedule and the other is not, I don’t know why it would arbitrarily wait.

We used to have such dependencies (on issue order) in the hardware (back in the Fermi days) but that should have been gradually ironed out starting with Kepler.

However its entirely possible there is something here I don’t know, and since we’re discussing generalities without specific test cases (at least: I don’t have your test case), I may have missed an important detail. There’s probably not much point in giving me a complex test case that has lots of dependencies on OpenCV, cudnn, and other things. I generally don’t have the time to work on things that are that involved.

Probably best to come up with the shortest possible complete test case, and file a bug.

If you don’t provide it originally, you would almost certainly be asked for a short, complete test case, with a full description of how to observe the issue. My advice again would be if possible, make that test case just depend on CUDA itself, not OpenCV, not cuDNN (if possible) and not anything else.

Thank you Robert, I can make a test case that only depends on CUDA. I already have half of it.