Why doesn't overlapping data transfers and kernel execution work here?

hstlr · January 21, 2025, 9:51am

Hi there.

I’m relatively new to CUDA and have been working my way through a bunch of CUDA tutorials, documentation bits and forum posts over the last weeks. My C++ PoC code launching a kernel that calls a couple of device-only functions doing some geometrical operations works fine.

Right now, I’m trying to optimize data transfers and kernel execution to further improve the programme’s performance. Overlapping data transfers and kernel executions seems to be a good way as we’re talking about hundreds of millions of geometrical objects that can be processed independently. When I profile the programme in Nsight Systems 2024.7.1 though, transfers and kernel executions seem to happen strictly sequentially. I’d appreciate any suggestions on what I’m doing wrong here.

GPU specs:

NVIDIA RTX A2000 Laptop GPU
memory: 4095 MiB
deviceOverlap: 1
concurrentKernels : 1
asyncEngineCount: 5
computeCapability: 8.6

Relevant bits of PoC code:

long calcGPU(PointXY* ptRequest, long cPtRequest, PointXY* ptPoly, long cPtPoly, bool* bPtRequestIn, bool bOverlapTransactions) {
    long begin = GetTickCount();

    PointXY* ptRequestDev;
    PointXY* ptPolyDev;
    bool* bPtRequestInDev;

    // Gerät setzen und Eigenschaften holen
    cudaSetDevice(0);
    cudaDeviceProp* devProps = new cudaDeviceProp;
    cudaGetDeviceProperties(devProps, 0);
    
    cudaMalloc((void**)&ptRequestDev, cPtRequest * sizeof(PointXY));
    cudaMalloc((void**)&ptPolyDev, cPtPoly * sizeof(PointXY));
    cudaMalloc((void**)&bPtRequestInDev, cPtRequest * sizeof(bool));

    cudaMemcpy(ptPolyDev, ptPoly, cPtPoly * sizeof(PointXY), cudaMemcpyHostToDevice);

    int blockSize = 512;
    int numBlocks = (cPtRequest + blockSize - 1) / blockSize;

    // *** Daten in kleineren Portionen asynchron übertragen und parallel berechnen ***
    if (bOverlapTransactions && devProps->deviceOverlap) {
        cudaError_t result;

        int iNumGroups = devProps->asyncEngineCount;
        int iChunkSize = (cPtRequest + iNumGroups - 1) / iNumGroups;    // aufrunden, damit alle Daten verarbeitet werden

        cudaStream_t* streams = new cudaStream_t[iNumGroups];

        for (int iIndex = 0; iIndex < iNumGroups; iIndex++) {
            cudaStreamCreate(&streams[iIndex]);

            int offset = iIndex * iChunkSize;
            int iCurrChunkSize = iChunkSize;

            if (offset + iCurrChunkSize > cPtRequest) {
                iCurrChunkSize = cPtRequest - offset;
            }

            // ptRequest and bPtRequestIn contain pinned memory allocated using cudaMallocHost((void**)&ptRequest, cPtRequest * sizeof(PointXY)) and cudaMallocHost((void**)&bPtRequestIn, cPtRequest * sizeof(bool))
            cudaMemcpyAsync(&ptRequestDev[offset], &ptRequest[offset], iCurrChunkSize * sizeof(PointXY), cudaMemcpyHostToDevice, streams[iIndex]);
            ptInPolyKernel << <(iCurrChunkSize + blockSize - 1) / blockSize, blockSize, 0, streams[iIndex] >> > (ptRequestDev, cPtRequest, ptPolyDev, cPtPoly, bPtRequestInDev, offset);
            cudaMemcpyAsync(&bPtRequestIn[offset], &bPtRequestInDev[offset], iCurrChunkSize * sizeof(bool), cudaMemcpyDeviceToHost, streams[iIndex]);
        }

        cudaDeviceSynchronize();

        // Streams aufräumen
        for (int iIndex = 0; iIndex < iNumGroups; iIndex++) {
            cudaStreamDestroy(streams[iIndex]);
        }

        delete[] streams;
    }

    cudaFree(ptRequestDev);
    cudaFree(ptPolyDev);
    cudaFree(bPtRequestInDev);

    return GetTickCount() - begin;
}

Nsight Systems screenshot:

hstlr · January 23, 2025, 10:42am

…to be more precise: I just ran the code from the Nvidia example on overlapping and even here overlapping does not work.

Console output of the sample code

Device : NVIDIA RTX A2000 Laptop GPU
Time for sequential transfer and execute (ms): 6.953056
max error: 1.192093e-07
Time for asynchronous V1 transfer and execute (ms): 8.154112
max error: 1.192093e-07
Time for asynchronous V2 transfer and execute (ms): 6.227968
max error: 1.192093e-07

Nsight profiling for asynchronous version 1

Nsight profiling for asynchronous version 2

So, basically, even the sample code gets executed all sequentially. Does my GPU not support asynchronous copying and execution, even though the device properties say so? Or what am I doing wrong here?

Curefab · January 23, 2025, 11:25am

Hi hstir,
I would do calls like cudaStreamCreate outside of/before the actual loop.
Not sure, whether the creation works asynchronously.

Asynchronous copies only work with pinned memory. Have you made sure that your host memory is pinned?

Make sure that CUDA_LAUNCH_BLOCKING is not set (to 1) in the environment.

hstlr · January 23, 2025, 12:11pm

Thank you for your reply.
I just moved the cudaStreamCreates into another for-loop in my playground project. This didn’t do the trick, though, as far as I can see.

The memory in use is allocated earlier using cudaMallocHost((void**)&ptRequest, cPtRequest * sizeof(PointXY)). Sorry for not making that clearer in the OP.

How would I find out if CUDA_LAUNCH_BLOCKING is set? I’m working either from Visual Studio or launching the compiled .exe via Nvidia Nsight Systems. I couldn’t find any hint to CUDA_LAUNCH_BLOCKING either in the Visual Studio settings or in my system’s environment variables. Would it be a worthwhile try explicitly setting it to 0? And where would I have to do that?

Also, please note my fully-working CUDAPlayground code I currently use for testing. Below you can also find the latest Nsight profiling results after I moved the cudaStreamCreate calls.

CUDAPlayground code

#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <iostream>

class PointXY {
public:
    float x;
    float y;

    PointXY(float x, float y) {
        this->x = x;
        this->y = y;
    }

    PointXY() {
        this->x = 0;
        this->y = 0;
    }
};

const float multip = 5;

inline
cudaError_t checkCuda(cudaError_t result)
{
#if defined(DEBUG) || defined(_DEBUG)
    if (result != cudaSuccess) {
        fprintf(stderr, "CUDA Runtime Error: %s\n", cudaGetErrorString(result));
    }
#endif
    return result;
}

__global__ void ptInPolyKernel(PointXY* ptRequest, long cPtRequest, int offset, int chunksize, float multiplier) {

    int index = offset + blockIdx.x * blockDim.x + threadIdx.x;

    int i = index;
    if (i < cPtRequest) {
        ptRequest[i].x = ptRequest[i].x * multiplier;
        ptRequest[i].y = ptRequest[i].y * multiplier;
    }
}

// Eine Zahl auf das nächste Vielfache der zweiten übergebenen Zahl aufrunden.
long roundUpToMultiple(double value, long multiple) {
    return (long) ceil(value / multiple) * multiple;
}

int main()
{
    int cPtRequest;

    std::cout << "Please enter the number of PointXY objects you want to transfer to the GPU: ";
    std::cin >> cPtRequest;

    PointXY* ptRequest;
    PointXY* ptRequestDev;

    // Gerät setzen und Eigenschaften holen
    checkCuda(cudaSetDevice(0));
    cudaDeviceProp* devProps = new cudaDeviceProp;
    cudaGetDeviceProperties(devProps, 0);

    checkCuda(cudaMallocHost((void**)&ptRequest, cPtRequest * sizeof(PointXY)));
    checkCuda(cudaMalloc((void**)&ptRequestDev, cPtRequest * sizeof(PointXY)));

    // Testdaten erzeugen
    for (int i = 0; i < cPtRequest; i++) {
        ptRequest[i] = PointXY(i, i);
    }

    int blockSize = 512;
    int numBlocks = (cPtRequest + blockSize - 1) / blockSize;

    // *** Daten in kleineren Portionen asynchron übertragen und parallel berechnen ***
    cudaError_t result;

    //int iNumGroups = devProps->asyncEngineCount;
    int iNumGroups = 5;
    int iChunkSize = (cPtRequest + iNumGroups - 1) / iNumGroups;    // aufrunden, damit alle Daten verarbeitet werden
    iChunkSize = (int) roundUpToMultiple(iChunkSize, blockSize);    // iChunkSize soll Vielfaches der blockSize sein, um unnötige Berechnungen zu vermeiden
    std::cout << "iChunkSize = " << iChunkSize << "\n";

    // benötigte CUDA-Streams erzeugen
    cudaStream_t* streams = new cudaStream_t[iNumGroups];
    for (int i = 0; i < iNumGroups; i++) {
        cudaStreamCreate(&streams[i]);
    }

    for (int iIndex = 0; iIndex < iNumGroups; iIndex++) {
        int offset = iIndex * iChunkSize;
        int iCurrChunkSize = iChunkSize;

        if (offset + iCurrChunkSize > cPtRequest) {
            iCurrChunkSize = cPtRequest - offset;
        }

        cudaMemcpyAsync(&ptRequestDev[offset], &ptRequest[offset], iCurrChunkSize * sizeof(PointXY), cudaMemcpyHostToDevice, streams[iIndex]);
        ptInPolyKernel << <(iCurrChunkSize + blockSize - 1) / blockSize, blockSize, 0, streams[iIndex] >> > (ptRequestDev, cPtRequest, offset, iCurrChunkSize, multip);
        cudaMemcpyAsync(&ptRequest[offset], &ptRequestDev[offset], iCurrChunkSize * sizeof(PointXY), cudaMemcpyDeviceToHost, streams[iIndex]);
    }

    cudaDeviceSynchronize();

    // Streams aufräumen
    for (int iIndex = 0; iIndex < iNumGroups; iIndex++) {
        checkCuda(cudaStreamDestroy(streams[iIndex]));
    }

    delete[] streams;

    // Ergebnisse prüfen
    long cFehler = 0;
    long iFehlerErster = -1;
    long iFehlerLetzter = -1;
    long iAbweichungKumul = 0;

    for (int i = 0; i < cPtRequest; i++) {
        float iSoll = i * multip;

        if (!(ptRequest[i].x == iSoll && ptRequest[i].y == iSoll)) {
            if (iFehlerErster < 0) {
                iFehlerErster = i;
            }

            if (i > iFehlerLetzter) {
                iFehlerLetzter = i;
            }

            cFehler++;
        }
    }

    if (cFehler > 0) {
        std::cout << cFehler << " errors between i = " << iFehlerErster << " and i = " << iFehlerLetzter << "\n";
    }
    
    cudaFreeHost(ptRequest);
    cudaFree(ptRequestDev);

    return 0;
}

Nsight Systems profiling of CUDAPlayground

Curefab · January 23, 2025, 12:49pm

It was a bit of a long shot, the CUDA_LAUNCH_BLOCKING is set as environment variable mostly for debugging purposes. But it would lead to no asynchronous execution, so it would fit the symptoms. With it, the cudaSynchronize bar in the CUDA API row would probably not go over the whole width. Same with the pinned memory.
You could quickly test the environment by calling something like system("cmd.exe /cset > environment.txt") in your program.
You could activate the CUDA_LAUNCH_BLOCKING to see, what difference your current results have to actual blocking. Perhaps this gives another hint.

It is a laptop GPU, so it would be difficult to test with another graphics card?
But you could run your program on another system?
If you insert callbacks to the host in your stream, you could test the timing without the CUDA SDK/Nsight Systems.

I am quite at the end of ideas. Perhaps somebody else could chime in?

hstlr · January 23, 2025, 1:01pm

Yeah, I just thought of setting CUDA_LAUNCH_BLOCKING within Nsight Systems.
And - what the heck?!? When I explicitly set CUDA_LAUNCH_BLOCKING to 1 in Nsight Systems like so:

CUDA_LAUNCH_BLOCKING set

…there’s actually overlapping taking place?!?

Nsight profiling with CUDA_LAUNCH_BLOCKING set

I mean…there’s still no overlapping happening between kernel calls and memcpy calls, but at least it’s a start. And I totally don’t get why.

[EDIT:]
Oh, I forgot about your questions.

Testing with another GPU within this computer won’t be possible for the reason you mentioned. Testing with another computer does happen every now and then. So far, though, it didn’t happen using Nsight Systems, so right now, I can’t say anything about the overlapping in that scenario. Will do that as soon as possible.

Curefab · January 23, 2025, 1:07pm

Really strange!

Would you do CUDA_LAUNCH_BLOCKING=0?

I think there is a small overlapping of the kernel and the next h2d copy. Perhaps make the kernel slightly longer/slower for testing.

striker159 · January 23, 2025, 1:53pm

It seems that the program is running on Windows so this might just be a WDDM artifact.

hstlr · January 23, 2025, 2:04pm

After a couple more tests:

1.) On my colleague’s work computer (also Nvidia RTX A2000, but desktop version and apparently from a newer generation) all calls run sequentially as well. CUDA_LAUNCH_BLOCKING doesn’t have any effect here, regardless of the set value. He will also profile the programme on his private hardware over the weekend.

2.) Running the original version of the programme (with more work to do within the kernels), I get the following results:

CUDA_LAUNCH_BLOCKING = 0

CUDA_LAUNCH_BLOCKING = 1

So… With CUDA_LAUNCH_BLOCKING = 0, there is no overlapping whatsoever. With CUDA_LAUNCH_BLOCKING = 1, there is overlapping between memcpy calls, but still no overlapping between kernels or memcpy with kernels.

@striker159
Yes, everything is indeed running under Windows. What are you suggesting? Would that mean that under Windows I couldn’t use the full potential of a CUDA GPU?

[EDIT:]
Looking more closely at the recent profiling results with CUDA_LAUNCH_BLOCKING = 1, it kind of makes sense that there are no overlappings with kernel executions here as - of course - each kernel can only be launched as soon as the needed data got copied. What I would expect from my code here (see the for-loop in the relevant PoC code in my OP) would be that as soon as the first bit of data got copied from the Host to the Device using the first CudaStream, the copying of the second bit of data should start in the second CudaStream, thus overlapping with the kernel execution in the first stream. This is, assuming that it is not possible to have two memcpy calls working in the same direction (Host to Device, or vice versa) simultaneously. (I read this somewhere, although I’m still hoping for any official info on this.)

Any ideas why, instead, CUDA waits for the kernel launch to finish before it starts the next memcpy call for Host to Device?

Robert_Crovella · January 23, 2025, 3:09pm

try changing the setting of GPU HW scheduling in windows.

hstlr · January 23, 2025, 4:00pm

I just tried this, thanks for the hint. This way, I’m back to fully-sequential execution though. With GPU HW scheduling enabled, my PoC programme shows asyncEngineCount: 1 on startup. Earlier, it always stated asyncEngineCount: 5. I have no clue what this behaviour means…

[EDIT:]
I just confirmed, switching GPU HW scheduling off again gives me back my asyncEngineCount: 5…

hstlr · January 27, 2025, 11:39am

We got some new information while testing the programme mentioned in the OP on two more machines:

Tests on GeForce MX550

The GPU’s specs in this laptop are:

Conclusion:
So, this means that testing on the GeForce MX550, our test code works exactly as hoped, overlapping memcpy calls towards the device with memcpy calls from the device as well as with kernel launches. Here, it doesn’t make any difference if GPU HW scheduling is switched on or off.

Also, for the above posted profiling, CUDA_LAUNCH_BLOCKING was either set to 0 or not explicitly set. With CUDA_LAUNCH_BLOCKING = 1, memcpy calls to and from the device still got overlapped, but kernels were launched blocking. This behaviour would fit the Nvidia documentation of CUDA_LAUNCH_BLOCKING: Disables (when set to 1) or enables (when set to 0) asynchronous kernel launches. In other words: Following the documentation, CUDA_LAUNCH_BLOCKING should not have any impact on the overlapping of memcpy operations, but only on the execution of GPU kernels.

Tests on RTX A2000 Laptop

With CUDA_LAUNCH_BLOCKING = 0:

With CUDA_LAUNCH_BLOCKING = 1:

GPU specs:

Wikipedia says we’re talking about this GPU: RTX A2000 Mobile

Conclusion:
On this GPU, launching the programme with CUDA_LAUNCH_BLOCKING = 1 results in overlapping memcpy calls, but non-overlapping kernel launches, which is the expected behaviour following the Nvidia documentation.
However, setting CUDA_LAUNCH_BLOCKING = 0 or not explicitly setting it results in a completely sequential execution of the programme with a completely different execution order.

As stated earlier, activating the Windows GPU HW scheduling here also results in a completely sequential execution plus makes the GPU say it only has 1 asyncEngine available.

Tests on RTX A2000 Desktop version

We had done this test earlier, the outcome is the same as with the RTX A2000 Laptop GPU. Interesting is that here also the effect of toggling Windows GPU HW scheduling is the same:

GPU HW scheduling = 0:
GPU Specs TM Desktop without GPU HW scheduling

GPU HW scheduling = 1:
GPU Specs TM Desktop with GPU HW scheduling

Wikipedia says we’re talking about this GPU: RTX A2000

Note how the value of the property asyncEngineCount changes. Exactly the same happens on my RTX A2000 Laptop GPU.

So, in conclusion, these tests show that my test code itself works in respect to overlapping transfers and kernel executions. The question remains why it doesn’t work on the 2 newer GPUs we tested. So far, we lacked the opportunity to test the programme on an even newer GPU, from another generation and maybe a consumer GPU instead of one from the “professional” series.

Also, is anybody aware of any breaking changes between Nvidia Compute Capability 7.5 (here, overlapping executions worked) and Nvidia Compute Capability 8.6 (here, they didn’t)? Is there anything I’d have to handle differently?

Thanks everybody for staying interested!

Curefab · January 27, 2025, 12:11pm

Another test could be to use Linux - there are distributions that can be run from USB not needing installation.

Or to install two graphics cards in a Desktop PC and use TCC on the non-display one.

hstlr · January 29, 2025, 9:15am

Good point. We will try that.

Meanwhiles, my colleague tested the PoC code on his private Windows computer, using an RTX 4080 (16 GB). Compute Capibility here is stated to be 8.9, asyncEngineCount is 5.

Behaviour is basically the same as on the RTX A2000s:

With Windows GPU HW scheduling enabled, nothing overlaps. Furthermore, asyncEngineCount is again stated to be 1 instead of 5.
With Windows GPU HW scheduling disabled and CUDA_LAUNCH_BLOCKING = 1, memcpy calls towards and from the GPU overlap.
With CUDA_LAUNCH_BLOCKING = 0, nothing overlaps.
Kernel executions never overlap.

So, right now the score (everything tested under Windows) is:

Running the code using a GeForce MX550 Mobile GPU (Compute Capability = 7.5), memcpy calls and kernel launches run in parallel.
Running the code using three other GPUs (Compute Capability >= 8.6), only memcpy calls run in parallel in certain configurations, while kernel launches always run sequentially.

This brings me back to my question: Have there been any changes in the meantime, either to the CUDA drivers or hardware-wise, that require a different implementation in order to achieve overlapping memcpy calls and kernel launches?

hstlr · January 30, 2025, 11:05am

Another test could be to use Linux

I’m currently struggling with this. Apparently, cross-compiling for Linux using Windows is not supported and at least not easily achievable. Installing CUDA Toolkit within a VirtualBox-VM fails because there is no CUDA-compatible GPU available here. The next thing I’m going to try is installing everything I need from a Live Ubuntu. I wonder if that’s going to work, though, despite my machine’s 32 GB of RAM. If you have any suggestions on how to do this, I’d appreciate that.

[EDIT:]
I finally managed to get the CUDA Toolkit installed under WSL2. There, I was able to compile my PoC programme after removing some Windows-specific code. Running the programme from WSL2 doesn’t properly work, though: The results indicate that the programme can’t get access to the CUDA-capable GPU.

But that wouldn’t have helped me anyway as, apparently, so far, running Nsight Systems under WSL is not supported. So, the next step will be to get the needed driver installed in a Ubuntu Live System and run the PoC code there, using Nsight Systems.

snarky · January 30, 2025, 5:06pm

WSL2 is the Linux “user space” but still the Windows “kernel space.”

I suggest sticking linux on a USB stick and booting from that. It won’t be super fast, but it will be the real thing.

hstlr · February 3, 2025, 11:22am

I tried just booting Linux from a USB stick. The problem is that I can’t get the needed proprietary driver installed. At some point I’d always have to reboot the system which would make me lose any changes made to a non-persistent system.

But I’m going to install a full Ubuntu onto a USB hard drive and try it from there.

hstlr · February 4, 2025, 1:40pm

So… I finally got everything set up in my freshly installed Ubuntu 24.04 and was able to profile my PoC code there. Although, right until the end, I could not convince Nsight Systems to display a console window for the profiled programme which it did out of the box in the Windows version. Thus, I had to make a few minor changes to the code in order to take the user interaction out of the game.

However, these are the results...

So, under Ubuntu, my PoC code also runs with parallel memcpy calls and kernel executions. Now, what does that mean? There HAS to be a way to make this work under Windows, right? I mean, on another computer with another GPU, it DID work under Windows…

A strange thing I noticed is that in Ubuntu, my programme stated an asyncEngineCount of 2, while using Windows, it supposedly uses 5 async engines. How’s that? I supposed this was hardware-dependent information rather than OS- oder driver-dependent?

Topic		Replies	Views
How to Overlap Data Transfers in CUDA C/C++ Technical Blog	23	2189	January 18, 2023
Bug in cudaMemsetAsync or in Nsight VS Edition when visualizing cudaMemsetAsync execution CUDA Programming and Performance	13	1389	November 11, 2021
Wishlist Place your considered suggestions here CUDA Programming and Performance	201	204313	April 13, 2009
Problem with IPC CUDA Programming and Performance	10	3390	May 27, 2020
Cannot get any stream parallelism. CUDA Programming and Performance	13	1264	December 31, 2019
Misaligned access violation CUDA Programming and Performance	15	4292	July 7, 2017
cudaMemcpyPeerAsync behavior for different hardware CUDA Programming and Performance cuda	6	390	May 13, 2024
CUDA very slow performance CUDA Programming and Performance	21	16520	March 6, 2020
Maximizing Unified Memory Performance in CUDA Technical Blog	18	1183	May 14, 2019
CUDA/OpenCL runs multiple GPUs sequentially CUDA Programming and Performance	16	19317	November 26, 2015

Why doesn't overlapping data transfers and kernel execution work here?

Related topics