Understanding the output of Nsight Systems (CUDA API row vs. rows in CUDA HW section)

Hello, everyone

I have a CUDA application that I trace using Nsight Systems (System: GPU: GeForce RTX 2050, OS: Windows 11, CUDA 12.4). I watched some tutorials about the tool and from what I understand, the CUDA API row shows the time that the host spends on each API call or kernel call.
When we use a synchronous API like cudaMemcpy, we expect that the host will not return until the work of that API has been finished on the GPU. However, what I see on the Nsight Systems’ timeline isn’t consistent with this:

What I see here is that the host calls cudaMemcpy and this call takes some time, then before the API even starts working on the GPU, the host returns and issues the next API which is another cudaMemcpy. Is my understanding correct? If so how is it possible that a synchronous API call returns before it is executed on the GPU?

Thank you.

Please read the CUDA Runtime API documentation on API synchronization behavior.

The API provides memcpy/memset functions in both synchronous and asynchronous forms, the latter having an “Async” suffix. This is a misnomer as each function may exhibit synchronous or asynchronous behavior depending on the arguments passed to the function.

I read the link you provided. However, I am doing a H2D cudaMemcpy (from pageable host memory). What I see in the timeline is asynchronous behavior! Also, quoting from the link:

For transfers from pageable host memory to device memory, a stream sync is performed before the copy is initiated. The function will return once the pageable buffer has been copied to the staging memory for DMA transfer to device memory, but the DMA to final destination may not have completed.

Even if this was the case, the cudaMemcpy shown in CUDA API row and the Memory row from CUDA HW section should have had some overlap but they don’t. So, unfortunately, I still do not understand.

The screenshot does not convey much information. Small copies (10s of KiBs) can be done through different mechanism than the asynchronous copy engine. This route can be asynchronous. The best approach is the file a bug against the CUDA driver/documentation requesting improvement to the documentation. Please include a minimal reproducible.

The amount of copy is around 3 KiBs. This is the minimal reproducible code:

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <stdio.h>
#include <vector>
#include <iostream>
#include <string>

#define gpuErrchk() { gpuAssert(__FILE__, __LINE__); }
inline void gpuAssert(const char* file, int line, bool abort = true)
{
    cudaDeviceSynchronize();
    cudaError_t code = cudaGetLastError();
    if (code != cudaSuccess)
    {
        fprintf(stderr, "GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
        if (abort) exit(code);
    }
}

__global__ void dotProduct_v1(float* out, float* in1, float* in2, const int len)
{
    __shared__ float cache[1024];
    int i = threadIdx.x;
    int cacheIdx = threadIdx.x;
    float tmp{};
    while (i < len)
    {
        tmp += in1[i] * in2[i];
        i += blockDim.x;
    }

    cache[cacheIdx] = tmp;
    __syncthreads();

    i = blockDim.x / 2;
    while (i != 0)
    {
        if (cacheIdx < i)
            cache[cacheIdx] += cache[cacheIdx + i];
        __syncthreads();

        i /= 2;
    }
    if (cacheIdx == 0)
        out[0] = cache[0];
}

void dotProductExample()
{
    const int inputLen = 1024 - 200;
    std::vector<float> in1(inputLen), in2(inputLen);
    float innerProduct;

    std::fill(in1.begin(), in1.end(), 1.0);
    std::fill(in2.begin(), in2.end(), 1.0);

    float* d_in1, * d_in2, * d_out;
    cudaMalloc(&d_in1, inputLen * sizeof(float));
    cudaMalloc(&d_in2, inputLen * sizeof(float));
    cudaMalloc(&d_out, sizeof(float));
    gpuErrchk();

    cudaMemcpy(d_in1, in1.data(), inputLen * sizeof(float), cudaMemcpyHostToDevice);
    cudaMemcpy(d_in2, in2.data(), inputLen * sizeof(float), cudaMemcpyHostToDevice);
    cudaMemset(d_out, 0.0, sizeof(float));

    dotProduct_v1 << <1, 1024 >> > (d_out, d_in1, d_in2, inputLen);
    gpuErrchk();

    cudaMemcpy(&innerProduct, d_out, sizeof(float), cudaMemcpyDeviceToHost);
    std::cout << "dotProduct result: " << innerProduct << std::endl;
}

int main()
{
    cudaError_t cudaStatus;

    // Choose which GPU to run on, change this on a multi-GPU system.
    cudaStatus = cudaSetDevice(0);
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaSetDevice failed!  Do you have a CUDA-capable GPU installed?");
        return 1;
    }

    dotProductExample(); 

    // cudaDeviceReset must be called before exiting in order for profiling and
    // tracing tools such as Nsight and Visual Profiler to show complete traces.
    cudaStatus = cudaDeviceReset();
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaDeviceReset failed!");
        return 1;
    }

    return 0;
}

That makes sense. I increased the input len to 2MiB and see these in the timeline (I had to increase it to at least 500KiB to see some overlap between the API call and the actual execution on the GPU):

I’ll file a bug and request a more detailed explanation of cudaMemcpy behavior in the docs. Thank you.