Understanding the output of Nsight Systems (CUDA API row vs. rows in CUDA HW section)

nastaran.kianersi · April 10, 2024, 5:12pm

Hello, everyone

I have a CUDA application that I trace using Nsight Systems (System: GPU: GeForce RTX 2050, OS: Windows 11, CUDA 12.4). I watched some tutorials about the tool and from what I understand, the CUDA API row shows the time that the host spends on each API call or kernel call.
When we use a synchronous API like cudaMemcpy, we expect that the host will not return until the work of that API has been finished on the GPU. However, what I see on the Nsight Systems’ timeline isn’t consistent with this:

What I see here is that the host calls cudaMemcpy and this call takes some time, then before the API even starts working on the GPU, the host returns and issues the next API which is another cudaMemcpy. Is my understanding correct? If so how is it possible that a synchronous API call returns before it is executed on the GPU?

Thank you.

Greg · April 10, 2024, 7:03pm

Please read the CUDA Runtime API documentation on API synchronization behavior.

The API provides memcpy/memset functions in both synchronous and asynchronous forms, the latter having an “Async” suffix. This is a misnomer as each function may exhibit synchronous or asynchronous behavior depending on the arguments passed to the function.

nastaran.kianersi · April 10, 2024, 7:15pm

I read the link you provided. However, I am doing a H2D cudaMemcpy (from pageable host memory). What I see in the timeline is asynchronous behavior! Also, quoting from the link:

For transfers from pageable host memory to device memory, a stream sync is performed before the copy is initiated. The function will return once the pageable buffer has been copied to the staging memory for DMA transfer to device memory, but the DMA to final destination may not have completed.

Even if this was the case, the cudaMemcpy shown in CUDA API row and the Memory row from CUDA HW section should have had some overlap but they don’t. So, unfortunately, I still do not understand.

Greg · April 10, 2024, 11:35pm

The screenshot does not convey much information. Small copies (10s of KiBs) can be done through different mechanism than the asynchronous copy engine. This route can be asynchronous. The best approach is the file a bug against the CUDA driver/documentation requesting improvement to the documentation. Please include a minimal reproducible.

nastaran.kianersi · April 11, 2024, 1:20pm

The amount of copy is around 3 KiBs. This is the minimal reproducible code:

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <stdio.h>
#include <vector>
#include <iostream>
#include <string>

#define gpuErrchk() { gpuAssert(__FILE__, __LINE__); }
inline void gpuAssert(const char* file, int line, bool abort = true)
{
    cudaDeviceSynchronize();
    cudaError_t code = cudaGetLastError();
    if (code != cudaSuccess)
    {
        fprintf(stderr, "GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
        if (abort) exit(code);
    }
}

__global__ void dotProduct_v1(float* out, float* in1, float* in2, const int len)
{
    __shared__ float cache[1024];
    int i = threadIdx.x;
    int cacheIdx = threadIdx.x;
    float tmp{};
    while (i < len)
    {
        tmp += in1[i] * in2[i];
        i += blockDim.x;
    }

    cache[cacheIdx] = tmp;
    __syncthreads();

    i = blockDim.x / 2;
    while (i != 0)
    {
        if (cacheIdx < i)
            cache[cacheIdx] += cache[cacheIdx + i];
        __syncthreads();

        i /= 2;
    }
    if (cacheIdx == 0)
        out[0] = cache[0];
}

void dotProductExample()
{
    const int inputLen = 1024 - 200;
    std::vector<float> in1(inputLen), in2(inputLen);
    float innerProduct;

    std::fill(in1.begin(), in1.end(), 1.0);
    std::fill(in2.begin(), in2.end(), 1.0);

    float* d_in1, * d_in2, * d_out;
    cudaMalloc(&d_in1, inputLen * sizeof(float));
    cudaMalloc(&d_in2, inputLen * sizeof(float));
    cudaMalloc(&d_out, sizeof(float));
    gpuErrchk();

    cudaMemcpy(d_in1, in1.data(), inputLen * sizeof(float), cudaMemcpyHostToDevice);
    cudaMemcpy(d_in2, in2.data(), inputLen * sizeof(float), cudaMemcpyHostToDevice);
    cudaMemset(d_out, 0.0, sizeof(float));

    dotProduct_v1 << <1, 1024 >> > (d_out, d_in1, d_in2, inputLen);
    gpuErrchk();

    cudaMemcpy(&innerProduct, d_out, sizeof(float), cudaMemcpyDeviceToHost);
    std::cout << "dotProduct result: " << innerProduct << std::endl;
}

int main()
{
    cudaError_t cudaStatus;

    // Choose which GPU to run on, change this on a multi-GPU system.
    cudaStatus = cudaSetDevice(0);
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaSetDevice failed!  Do you have a CUDA-capable GPU installed?");
        return 1;
    }

    dotProductExample(); 

    // cudaDeviceReset must be called before exiting in order for profiling and
    // tracing tools such as Nsight and Visual Profiler to show complete traces.
    cudaStatus = cudaDeviceReset();
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaDeviceReset failed!");
        return 1;
    }

    return 0;
}

That makes sense. I increased the input len to 2MiB and see these in the timeline (I had to increase it to at least 500KiB to see some overlap between the API call and the actual execution on the GPU):

I’ll file a bug and request a more detailed explanation of cudaMemcpy behavior in the docs. Thank you.

Topic		Replies	Views
A general question on Cuda Sync after kernal call CUDA Programming and Performance cuda	3	428	January 22, 2023
Confusion about synchronization or asynchronization of cudaMemcpy() and cudaMemcpyAsync() CUDA Programming and Performance	5	5113	December 23, 2023
cudaMemcpy behaving asynchronously with drivers 471.11+ CUDA Programming and Performance cuda , nvbugs	7	1486	July 21, 2021
Execution mode question: asynchronous or synchronous CUDA Programming and Performance	4	1444	January 26, 2011
cudaMemcpy latency between sequencial calls CUDA Programming and Performance	2	76	May 1, 2025
cudaMemcpyAsync problem CUDA Programming and Performance	9	3329	May 26, 2020
CUDA blocks all threads when doing a Device to Host MemCpyAsync to a pageable host memory location CUDA Setup and Installation	2	490	October 28, 2023
CPU blocked MUCH longer than expected calling a cudaMemcpy after a cuda graph launch CUDA Programming and Performance	7	712	October 19, 2023
Question about cudaMemcpyAsync measurement in nsys and async meaning Profiling Linux Targets nsight	0	861	February 22, 2022
cudaMemcpyAsync Device to Host : Need to synchronize before using data on host CUDA Programming and Performance	7	2507	October 7, 2022

Understanding the output of Nsight Systems (CUDA API row vs. rows in CUDA HW section)

Related topics