Clarification on cudaMemcpy synchronization behavior with pageable memory and non-blocking streams

Hello. I have a question about specification of data transfer by cudaMemcpy() and synchronization.

I understand that cudaMemcpy() uses the default stream and returns when copying has finished, but I found the following note in CUDA Runtime API documentation : 2. API synchronization behavior.

Memcpy
Synchronous

  1. For transfers from pageable host memory to device memory, a stream sync is performed before the copy is initiated. The function will return once the pageable buffer has been copied to the staging memory for DMA transfer to device memory, but the DMA to final destination may not have completed.

Consider the following sequence:

  1. The CPU prepares data on pageable host memory and transfers the data to GPU device memory with cudaMemcpy(…, cudaMemcpyHostToDevice).
  2. The CPU launches a CUDA kernel ‘func’ on stream A.
  3. CUDA kernel ‘func’ on stream A executes a task with the data on GPU device memory.

I previously assumed this sequence was safe because cudaMemcpy() is a synchronous API. However, the documentation above has made me concerned. If cudaMemcpy() returns before the actual DMA to the final GPU destination is complete:

  1. Is it possible for kernel func to start executing before the data has fully arrived in device memory?

  2. In a typical case, does the “implicit synchronization with the legacy default stream” prevent the kernel from starting until the DMA is complete?

  3. If Stream A was created with the cudaStreamNonBlocking flag (to avoid implicit synchronization), is this sequence still safe?

  4. If it is not safe, what is the most “cost-effective” way to ensure that the kernel waits for the DMA to complete?

  5. Should I replace cudaMemcpy(…, cudaMemcpyHostToDevice) with cudaMemcpyAsync(…, cudaMemcpyHostToDevice, stream_a) ?

  6. Should I allocate a pinned host memory area and copy data from the pageable source to the pinned host memory with CPU before calling cudaMemcpyAsync(…, cudaMemcpyHostToDevice, stream_a) so that asynchronous copying can be executed?

Background:

I recently read the following discussion and learned that casual use of the “cudaStreamNonBlocking” flag may cause unexpected troubles.
cudaMemcpyAsync, unexpected behaviour while using cudaStreamNonBlocking?

I was planning to set the “cudaStreamNonBlocking” flag for all streams to avoid performance problems caused by “implicit synchronization with legacy default stream”, but I want to ensure that I fully understand the synchronization guarantees first.

My current environment is CUDA 12.4 and I’ll move to CUDA 13.1.

Thank you for your help.

From my understanding, the copy will still be associated with the default stream even after the API call returns. So operations in streams not created with cudaStreamNonBlocking will not begin until all data arrived in gpu memory.

Personally, if I wanted to use multiple streams, I would try to avoid the default stream at all by using the *Async APIs and specifiying a non-default stream for all kernel launches.

1 Like

Dear striker159,

Thank you for your comments.

I agree with you that “cuda*Async” APIs should be used to avoid use of the default stream and all kernels should be launched with non-default streams explicitly.
I believe this is the best practice for handling multiple streams.

I’m trying to improve performance of considerable amount of CUDA and TensorRT codes written by other engineers.
I found that some of the codes were suffering from “implicit synchronization with legacy default stream” and I was planning to solve the problem by setting cudaStreamNonBlocking flag to all streams (easy and lazy solution…).
If setting cudaStreamNonBlocking flag has any drawback, I need to solve the problem without the flag.

Sincerely,

Google Gemini answered my question like the followings (as I understand them):

  • The adjectives “Synchronous” and “Asynchronous” are from the viewpoint of host CPU. They do not mean anything from the viewpoint of GPU and CUDA streams.
  • Calling cudaMemcpy() puts barrier directives to all of the streams on the GPU, which says “do not execute any requests beyond this barrier until this instance of data transfer completes.”
    • Even if the streams are created with the cudaStreamNonBlocking flag, the barriers are set.
    • This behavior is independent from so-called “implicit synchronization with legacy default stream”.
  • Calling cudaMemcpyAsync(…, streamA) puts the barrier only into streamA.

It was not able to give me any reference to NVIDIA-official descriptions about its answer, though I requested the source. I’m still skeptical about AI.
I appreciate it greatly if some kind soul shows me any reference.

I now believe the answer is correct because the following test code succeeds every time.
I run it on Jetson Orin (CUDA 12.6), compiling it by nvcc with/without the “–default-stream per-thread” option.

#include <iostream>
#include <vector>
#include <numeric>
#include <algorithm>
#include <random>
#include <cuda_runtime.h>

/**
 * Kernel to verify if the data in device memory matches the expected value.
 * If cudaMemcpy did not finish the actual DMA transfer before the kernel started,
 * some threads will detect incorrect values (either 0 from memset or old values).
 */
__global__ void checkDataKernel(const int* data, int expected, int* errorCount, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        if (data[idx] != expected) {
            atomicAdd(errorCount, 1);
        }
    }
}

int main() {
    // Large size (approx. 256MB) to increase transfer time and expose potential race conditions
    const int N = 1024 * 1024 * 64;
    const size_t size = N * sizeof(int);
    const int iterations = 100;

    // 1. Prepare Pageable Host Memory
    std::vector<int> h_src(N);

    int *d_data, *d_errorCount;
    cudaMalloc(&d_data, size);
    cudaMalloc(&d_errorCount, sizeof(int));

    // 2. Create a Non-blocking Stream
    // The 'cudaStreamNonBlocking' flag means this stream does not synchronize with the legacy default stream.
    cudaStream_t streamA;
    cudaStreamCreateWithFlags(&streamA, cudaStreamNonBlocking);

    // Prepare random number generator
    std::random_device rd;
    std::mt19937 gen(rd());
    std::uniform_int_distribution<int> dis(1, 1000000);

    int totalFailures = 0;

    std::cout << "Starting 100 iterations of synchronization test with cudaStreamNonBlocking..." << std::endl;

    for (int i = 0; i < iterations; ++i) {
        // Generate a new expected value for each iteration to ensure we are not reading stale data
        int expectedValue = dis(gen);
        std::fill(h_src.begin(), h_src.end(), expectedValue);

        // Reset device memory to zero to "taint" it before each transfer
        cudaMemset(d_data, 0, size);
        cudaMemset(d_errorCount, 0, sizeof(int));

        // Ensure the reset is complete before starting the test
        cudaDeviceSynchronize();

        // --- Start Verification ---

        // Synchronous cudaMemcpy (Host to Device) using the default stream.
        // For pageable memory, it might return as soon as the data is in the staging buffer.
        cudaMemcpy(d_data, h_src.data(), size, cudaMemcpyHostToDevice);

        // Launch the kernel immediately on the Non-blocking stream.
        // We are testing if this kernel waits for the cudaMemcpy's DMA to complete.
        int threadsPerBlock = 256;
        int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;
        checkDataKernel<<<blocksPerGrid, threadsPerBlock, 0, streamA>>>(d_data, expectedValue, d_errorCount, N);

        // Copy error count back to host using the same stream
        int h_errorCount = 0;
        cudaMemcpyAsync(&h_errorCount, d_errorCount, sizeof(int), cudaMemcpyDeviceToHost, streamA);
        // Wait for the kernel and error check to finish
        cudaStreamSynchronize(streamA);

        // --- End Verification ---

        if (h_errorCount > 0) {
            std::cout << "Iteration " << i << ": FAILURE (Errors detected: " << h_errorCount << ")" << std::endl;
            totalFailures++;
        } else if (i % 10 == 0) {
            std::cout << "Iteration " << i << ": Passed..." << std::endl;
        }
    }

    if (totalFailures == 0) {
        std::cout << "\nRESULT: All 100 iterations PASSED." << std::endl;
        std::cout << "cudaMemcpy properly synchronized with the non-blocking stream on this system." << std::endl;
    } else {
        std::cout << "\nRESULT: FAILED in " << totalFailures << " out of " << iterations << " iterations." << std::endl;
        std::cout << "A synchronization gap between cudaMemcpy and cudaStreamNonBlocking was observed." << std::endl;
    }

    // Cleanup
    cudaFree(d_data);
    cudaFree(d_errorCount);
    cudaStreamDestroy(streamA);

    return 0;
}

This reference may clarify the situation.

Dear rs277,

Thank you very much for giving me the reference, but it is exactly the same description where my question was raised.
I read it again with my updated knowledge, but I still have to say that the description is not clear enough.

The behavior of cudaMemcpy() with pageable host memory is reasonable (without any surprise) and valid.
I just want to make it sure that CUDA specification guarantees the behavior.
I suspect that the specification guarantees it while it is not well documented.

Sincerely,

Apologies, I didn’t check your initial link.

The “cudaMemcpy() with pageable host memory”, case I would not regard as guaranteed, even though your tests have succeeded, hence the disclaimer.

cudaMemcpy() from pinned memory would appear to guarantee the kernel will not launch until the data is complete on the device.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.