Why doesn't overlapping data transfers and kernel execution work here?

Curefab · February 25, 2025, 4:22pm

From the 3 kernels, the bottleneck clearly is H2D copies.
(we saw so far with Nsight Systems)

What would be interesting is the memory section and the Warp state section.
You can open the more detailed views by pressing the small triangle at the section titles.

The warp state statistics will distinguish, whether the threads were all blocked waiting for data.

The memory statistics will show, how much memory was transferred.
The overview says L2 Cache Throughput of 24.2% reached and sees this as low (lower than 60%). But 24.2% of L2 Cache throughput is actually a lot for PCIe.

So that throughput is much higher than expected, whereas the kernel duration is much longer than expected. Perhaps the detailed numbers give a hint, what the reason is.

You do not have to understand all numbers at once (neither do I understand all numbers). When you move your mouse over a parameter, it shows an explanation. The diagrams in the opened/detailed sections are easier to grasp.

hstlr · February 27, 2025, 10:30am

Okay. Well. I watched two videos on how to work with Nsight Compute. There seems to be a lot to gain from using this profiler but also a lot you have to understand first.

Yeah, I reckoned that :-D

However… Here’s what I gathered so far.

Nsight Compute profiling

These are the main suggestions Nsight Compute mentions on the summary page when launched to perform a full profiling. I suppose they already offer valuable information, although I don’t fully understand what they’re saying.

Here’s one of the two memory sections. I wasn’t sure which one you wanted to see.

…and here’s the second one.

Here’s the Warp State section you asked for. I suppose the highlighted suggestion is a valuable hint as well. As far as I understand it, I’m not supposed to use thread-local memory. Which brings me to the next screenshot…

Nsight Compute specifically mentions lines 97 and 98 in my source code. While I don’t understand what the problem with line 97 is, my guess about line 98 was that the profiler doesn’t want me to twice refer to the same array element here. Maybe I should store the address in a local variable and access this instead? The suggestion in the screenshot mentioned before, though, makes me think it doesn’t like that I’m creating a new PointXY object within the thread. However, the point of this was to make sure the contents of the host-object get copied to device memory. I will try, though, what happens when I use ptRequestDevice[i] = ptRequestHost[i]; instead.

Another thing that came to my mind: I do have two M.2 SSDs installed in my system. As far as I know, this may affect the PCIe bandwidth available for other devices such as GPUs. I suppose this alone shouldn’t make the copy kernel take that long, though.

Curefab · February 27, 2025, 11:30am

Hi hstlr,
thank you for sharing the images.
That is the copy kernel for H2D transfers?

I do not think it is using local memory. That was just one possible explanation/hints for some values. Local and global memory uses very similar hardware features.

First a medium issue

Why are 40 MB read and 78 MB written?
The 40 MB are from reading PCIe and the 78 MB from writing to global memory?

What size is the data format of the elements of PointXY?

In the Memory Workload Analysis it is mentioned that only 16 bytes of the 32 bytes are used per sector.

I would assume that each warp reads and writes data aligned to 32 bytes? Even when considering the offset?

In general

We cannot get full PCIe bandwidth: The pipes are full, the threads are blocking, and nearly nothing gets through.

I think you already used all threads and SMs?

We could switch to wider (vector) transfers (128 bits per access per thread instead of 32 bits), to reduce the number of transactions in the pipes. As recommended on the LG Throttle Stalls hint, which you marked in red.

We could use asynchronous transfer commands.

We could even give up zero copy for H2D transfers.

If you want to continue that avenue, I would try to use asynchronous transfers with memcpy_async.

Either

or

The data goes into shared memory. That’s why the data is copied as batch of several transfers. compute would just store the result in global memory.

There probably is a complete example in the Cuda samples?

hstlr · February 27, 2025, 12:29pm

Yes, sir.

Well, okay. Even when I replace the PointXY constructor with a simple address assignment, the profiler’s statement stays roughly the same.

Profiling results without using PointXY constructor in cpyKernel

…I mean, I can see why the profiler would think that I don’t have to copy the memory here as I’m already using mapped pinned memory. But then, this line is the whole point of having a copy kernel…

I have no idea…

sizeof(PointXY) says it’s 8 bytes which makes sense as each PointXY object contains two coordinates of type float.

This is what the PointXY class looks like.

__host__ __device__ class PointXY {
    // Properties
public:
    float x;
    float y;

    __host__ __device__ PointXY(float x, float y) {
        this->x = x;
        this->y = y;
    }

    PointXY() {
        //
    }
};

I experimented with several grid configurations. I think in profiling run shown the screenshots I was using the following:

Grid configuration

int blockSize = 1024;
int numBlocks = 60;
const int blockSizeCpy = 512;
const int numBlocksCpy = 60;
int iChunkSize = 5000000;

…while using 3 Streams. While I suppose with 3 Streams it would have been more sensible to set blockSizeCpy = 256 in order to make H2D and D2H copying and ptInPoly execution possible all at the same time, I don’t think this should be the issue here, right?

As mentioned earlier, I’m exploring possibilities to employ the GPU for excessive geometrical calculations on behalf of my company. So, part of what I want to achieve is well-documented proof of concept code, ideally following the most important best practices and - on the other hand - offering reasonably good performance. That’s why I tried to dig into overlapping transfers in the first place. Obviously, this doesn’t work in all possible environments, which is a bummer. This brought me to implementing zero-copy kernels in order to enforce overlapping “manually” which seemed to be a promising way. While digging in further is really interesting and I see there’s a whole world to explore there, you kind of lost me in-depth analyzing how the kernels work.

Right now, I feel like I’m opening parenthesis after parenthesis and can’t seem to find the closing ones. I don’t really know which would the best way to go on from here.

Curefab · February 27, 2025, 1:04pm

So the constructor is okay!

Probably for some reason, Cuda reads first the x coordinates than the y coordinates instead of combining those into a vector command.

When looking at memory read by the whole warp, each of the two memory accesses has holes (for the other coordinate), that is why 50% of bandwidth is lost.

Quick-fixes for this: reinterpret_cast as float2 or handling the array as float array with twice as many elements.

A reason could also be that Cuda cannot be sure that PointXY is aligned at 8 byte boundaries. Could you add a manual alignment? (alignas(8) __host__ __device__ class PointXY {).

If you do those profiling runs, I would just use one single stream to focus on that specific kernel.

I (Munich based) could offer Cuda consulting/implementation work for your company, if needed.

I think zero-copy is working for D2H-Transfers fast enough.

So one option would be to use

cudamemcpyasync for H2D (perhaps even with write-combined, if it is better working with cudamemcpyasync), even if they do not overlap, and
zero-copy for D2H, perhaps even integrated into the kernel (instead of a separate copy kernel).

When writing, the threads are not blocked, as they do not have to wait for a result. So it would be simpler to integrate with computations.

Another option is to try to get memcpy_async with zero-copy kernels to run. memcpy_async copies larger blocks of memory and could be faster and better for H2D.

hstlr · February 27, 2025, 2:17pm

Where do you see that? Or are you referring to the “2,500,000 (50 %) excessive sectors” mentioned in the overview section?

I refactored the PointXY class to the following...

__host__ __device__ class PointXY {
    // Properties
public:
    /*float x;
    float y;*/

    float2 coordinates;

    /*__host__ __device__ PointXY(float x, float y) {
        this->x = x;
        this->y = y;
    }*/

    __host__ __device__ PointXY(float x, float y) {
        this->coordinates.x = x;
        this->coordinates.y = y;
    }

    __host__ __device__ PointXY(float2 coordinates) {
        this->coordinates.x = coordinates.x;
        this->coordinates.y = coordinates.y;
    }

    PointXY() {
        //
    }

    // Getters
    __host__ __device__ float getX() {
        return this->coordinates.x;
    }

    __host__ __device__ float getY() {
        return this->coordinates.y;
    }

    __host__ __device__ float2 getCoordinates() {
        return this->coordinates;
    }
};

…and now use getCoordinates() in the copy kernel in order to make CUDA load all the needed information at once.

copy kernel

__global__ void cpyToDeviceKernel(PointXY* __restrict__ ptRequestHost, PointXY* __restrict__ ptRequestDevice, long iChunkSize, int offset) {
    int startIndex = offset + blockIdx.x * blockDim.x + threadIdx.x;
    int stride = blockDim.x * gridDim.x;

    for (long i = startIndex; i < offset + iChunkSize; i += stride) {
        ptRequestDevice[i] = PointXY(ptRequestHost[i].getCoordinates());
    }
}

Now, I would like to check if that makes any difference. (Although, my first test using Nsight System suggests it actually does!)

I will try this in the next step, thank you!

Understood!

Thank you for your offer. If we do actually implement serious, production-aimed software using CUDA in the future, this may be something we will consider. Right now, I suppose, that would be a bit over-the-top.

So, you don’t think this is merely due to the fact that on D2H, only booleans - meaning: wayyy less data - get transferred?

Interesting suggestions. Thank you! I will look at those once I evaluated whether your quick-fixes had any impact.

[EDIT:]

It really DOES seem to make a huge difference!

Nsight System profiling

…having switched back to the original version, using two float members.

…using one float2 member, but reading it using .getX() and getY() methods.

…using float2 member and reading it using getCoordinates() method.

So, apparently, when I use a float2 property, CUDA reads both bits of it immediately and caches the not-yet needed component somewhere. That would explain why there’s only a slight difference between using getX() and getY() and getCoordinates() respectively. What do you think?

[EDIT 2:]
Plus, now I get partly new hints on optimizations in Nsight Compute!

Nsight Compute profiling

I suppose you did hit something important there :-)

[EDIT 3:]
Btw, I marked @njuffa’s reply as solution because it really answers the original question of the thread. I hope it’s still okay to keep this discussion going. If not, I can open a separate thread.

Curefab · February 27, 2025, 4:41pm

In the source code image you showed, where it refers to the following lines:

You can see two ldg operations on the righthand side, one with an offset of 4.
Also we got those reads, which were double as much memory (78 MB instead of 40 MB).
And the 50% coalescing errors.

By using a member variable of type float2, you implicitly changed the alignment to 8 bytes. So it could be better, because of the alignment alone or because of getCoordinates().

But the improvement seems to be much more than 2x?

Are the new speed numbers from Nsight Systems with a single stream?
Better, but still much longer than expected from the amount of transferred memory and the speed of PCIe.

No problem, will be glad to help if needed, just give me a message.

Surely that is one reason, but perhaps also, because when writing, the Cuda threads do not expect a return value, and are more efficient; being able to continue the program at once instead of blocking like for reads.

The hint is that you are using 60 * 256 threads. That is half the maximum.
Which could be fine to run in parallel to compute, but not fine for getting the highest speed possible for the kernel.

Could you please compare with the duration of cudaMemcpyAsync with the current amount of data to be copied in both directions? (Gladly from a past measurement).

The expectation is that the speed in the end only depends on the PCIe bandwidth.

For efficient background loading, the data transfer should be at least as fast as compute (if possible), otherwise they would still be the bottleneck, when run in the background.

The memcpy_async could lead to a further speed-up, as the threads do not have to wait for each element. On the other hand, I am not sure, if memcpy_async was performance-optimized by Nvidia for this use case (zero copy memory).

Currently you get 40 MB / 19 ms = 2.1 GB/s?

PCIe should give around 10x as much speed. Depending on with how many lanes your GPU is connected.

hstlr · March 3, 2025, 12:53pm

Thank you for more valuable insights!

Right! Reviewing that screenshot, I notice there is also an STG operation with 4 bytes of offset. I suppose this is when I call the PointXY constructor in order to create a deep copy of the object. In the new version, I’m also calling the constructor with a float2 object, so that’s something else that got improved.

Yes and yes.

Here's two comparing profilings.

Both runs used:

1 cudaStream for copying
20,000,000 PointXY objects to copy
a chunksize of also 20,000,000 objects
60 blocks with 1024 threads each for the copy kernel
the first run used two float properties and respective methods
the second run used one float2 property and respective methods

So, this refactoring alone made the copying almost 23 times faster, if my math is right here. In this example, about 152 MiB of data get copied within 74 ms. That’s still around 2 GiB/s of speed.

Here's a test using cudaMemcpyAsync

Here, also 20,000,000 elements got copied using 1 stream.

D2H is a bit faster, with 11.9 GiB/s. So, you’re right, cudaMemcpyAsync is still siginificantly faster than my copyKernel.

Curefab · March 3, 2025, 1:14pm

Thank you for the graphs and results.
The GPU seems to be connected either with PCIe3.0x16 or PCIe4.0x8, both with 15.75 GB/s theoretical peak rate. The speed is bidirectional: Both directions can sustain this bandwidth concurrently.

So you could continue in the following way, either

find existing examples of high bandwidth zero copy memory read sample programs or benchmarks and compare, or
give up zero memory copy for H2D, but keep it for D2H (either as separate copy kernel or integrated with computation for true zero copy), accepting non-perfect overlap between transfers and computation, or
implement memcpy_async in the H2D copy kernel, which could get higher bandwidth with less threads, but copies to shared memory first, and is a small bit more involved, but doable, or
depending on the duration of computation, accept the current copy speed (fast enough). And try to make the H2D copy kernel use less threads and work in parallel to computation

hstlr · March 3, 2025, 2:15pm

Maybe this now is due to me having two M.2 SSDs installed. Regretfully, I can’t find any further information on the concrete mainboard built into my laptop, so I can’t definitely confirm that. Anyway…

I will check those suggestions and see which works best for my project. Thank you.
I suppose with the last point you’re suggesting to get back to what we had in mind earlier, using this combined zero-copy approach in order to make copying and kernel execution work in parallel, right? That would require the copy kernels to work with similar speed when using less threads, right? Meaning: Copying would still have to take up approximately the same duration as the calculations.

Curefab · March 3, 2025, 2:29pm

Probably it is just the way the GPU is connected. There are some Mini PCs, which limit their PCIe to x8.
But it was important to know to compare the copy kernel speed to what is possible.

Yes, you got 74.3 ms for copying 160 MB H2D with the copy kernel on a single stream with maximum threads. If your computation for that amount of data takes a similar time or longer or only slightly shorter, the copy would not have to be faster. It can just happen in the background. If not then even non-overlapping copies are faster.

If you want to go that way, you have the two options to

make two or three streams and use the copy kernel and use less threads (so they can run concurrently), or
try to integrate the copy kernel into your computation kernel again, either
- directly reading the data as needed (zero copy fashion) or
- use dedicated warps to copy to shared memory (or to global memory) and read the data from there with the computation threads, or
- have warps sometimes do computation, sometimes do data loading

But probably 74.3ms is too long and one of the other three approaches in my previous post would be better.

hstlr · March 3, 2025, 3:12pm

Okay. Thank you.
I will have another look at Nsight Compute at this point. He tells me something about Warp Stalls, resolving this supposedly gives more opportunity to optimize copying speed. Maybe I will be able to get more out of that.

hstlr · March 4, 2025, 11:23am

I just tried using our custom copy kernels (former zero-copy) with the same kind of pinned memory as the original overlapping implementation instead of using mapped memory which we needed for using zero-copy. Meaning:

Code snippets for pinned memory with "zero-copy" approach

switch (testProperties.gpuMemoryStrategy) {
    case OVERLAP_TRANSACTIONS:
        checkCuda(cudaMallocHost((void**)&ptRequest, cPtRequest * sizeof(PointXY)));   // Speicher für PointXY-Objekte
        checkCuda(cudaMallocHost((void**)&bPtRequestIn, cPtRequest * sizeof(bool)));   // Speicher für Ergebnisrückgabe
        break;

    case ZERO_COPY:
        if (devProps.canMapHostMemory) {
            //checkCuda(cudaHostAlloc(&ptRequest, cPtRequest * sizeof(PointXY), cudaHostAllocMapped /* | cudaHostAllocWriteCombined*/));   // Speicher für PointXY-Objekte
            //checkCuda(cudaHostAlloc(&bPtRequestIn, cPtRequest * sizeof(bool), cudaHostAllocMapped /* | cudaHostAllocWriteCombined */ ));   // Speicher für Ergebnisrückgabe
            checkCuda(cudaMallocHost((void**)&ptRequest, cPtRequest * sizeof(PointXY)));   // Speicher für PointXY-Objekte
            checkCuda(cudaMallocHost((void**)&bPtRequestIn, cPtRequest * sizeof(bool)));   // Speicher für Ergebnisrückgabe
        }
        else {
            return false;
        }
        
        break;
}

// Zero-copy
if (gpuMemStrategy == ZERO_COPY && devProps.canMapHostMemory) {
    /*checkCuda(cudaHostGetDevicePointer(&ptRequestDev, ptRequest, 0));
    checkCuda(cudaHostGetDevicePointer(&bPtRequestInDev, bPtRequestIn, 0));*/
    checkCuda(cudaMalloc((void**)&ptRequestDev, cPtRequest * sizeof(PointXY)));
    checkCuda(cudaMalloc((void**)&bPtRequestInDev, cPtRequest * sizeof(bool)));
}
else {
    checkCuda(cudaMalloc((void**)&ptRequestDev, cPtRequest * sizeof(PointXY)));
    checkCuda(cudaMalloc((void**)&bPtRequestInDev, cPtRequest * sizeof(bool)));
}

Changes to the allocation of the needed memory.

__global__ void cpyToDeviceKernel(PointXY* __restrict__ ptRequestHost, PointXY* __restrict__ ptRequestDevice, long iChunkSize, int offset) {
    int startIndex = offset + blockIdx.x * blockDim.x + threadIdx.x;
    int stride = blockDim.x * gridDim.x;

    for (long i = startIndex; i < offset + iChunkSize; i += stride) {
        ptRequestDevice[i] = PointXY(ptRequestHost[i].getCoordinates());
    }
}

The copy kernel stays the same.

Now, I noticed that this also gives me a relevant speed-up, both for the copy kernel and the ptInPoly kernel.

Nsight System profilings

…using mapped host memory as needed for zero-copy.

…using “simple” pinned memory.

So, just using pinned memory makes this 2x as fast. Does this make any sense at all?
Seeing that I don’t actually use “zero-copied” data in this approach I wouldn’t really need mapped host memory, right?

Curefab · March 4, 2025, 11:38am

You actually do use zero copy, even when storing (copying) to global memory.
If it is still working correctly, then it is great!

I would have believed that both ways to allocate and pin or map memory would lead to the same result.

BTW you used less threads again (60 * 256) for both than in the previous tests.

hstlr · March 4, 2025, 12:15pm

Hm, okay. How is this still zero-copy? I thought zero-copy means that the GPU directly accesses the data stored in host RAM?

Nice. I’ll keep it like this then.

Yes, I know. I didn’t think this was relevant here as I just wanted to directly compare mapped host memory and pinned memory. For further tests concerning overall performance I will experiment with a few grid configurations for H2D, D2H and kernel execution.

Curefab · March 4, 2025, 1:05pm

The H2D copy kernel itself is seen by Cuda like a normal kernel. When it reads, it does so directly from host memory. (Storing the values afterwards is just a coincidence.)

It does not need an explicit copy like cudaMemcpy(Async) before the H2D copy kernel.

You can argue that it is not an actual zero copy using a copy kernel; than it is just the naming. But the memory requirements actually should be the same for a compute kernel using zero copy and the copy kernel using zero copy memory to read data from the host.

Okay, wanted to make sure it was not by accident.

hstlr · March 4, 2025, 4:02pm

I see! This makes sense. Thank you for your explanation. In that case, I’m surprised that zero-copy seems to be possible without using mapped host memory and the cudaHostGetDevicePointer call. Anyway, I will dig into this a little bit more.

hstlr · March 6, 2025, 10:23am

I’m currently going away from in-depth analyzing the copy kernel a bit. Instead, I am looking at efficiently parallelizing the copying and kernel executions.

I can’t seem to get CUDA to execute the first ptInPoly kernel in parallel with the second H2D device copying operation though.

Nsight System profiling

In this case, I used the following configuration:

const int iNumStreams = 2;
int blockSize = 1024;
int numBlocks = 20;
const int blockSizeCpy = 256;
const int numBlocksCpy = 20;

This is for testing purposes only. I am aware that this configuration is not ideal in respect to device occupancy.

In the screenshot, we can see that the later ptInPoly calls do in fact happen in parallel to H2D or D2H copying operations. However, the second H2D copying operation only starting after the ptInPoly execution has finished, to my mind suggests that the device is fully occupied with ptInPoly. This shouldn’t be the case, though. Or am I overlooking something here?

According to the device properties, my GPU should support the following:

SMs: 20
maxThreadsPerMultiprocessor: 1536
warpSize: 32
maxThreadsPerBlock: 1024
maxBlocksPerMultiprocessor: 16
maxGridSize[0]: 2147483647

Curefab · March 6, 2025, 11:04am

Could you reduce blockSize to 896? (896 is also divisible by 128, but still larger than 1536/2) and reduce blockSizeCpy to 128?
As your numBlocks is 20, you could even go lower with blockSize

Just to make it easier for the kernels to be run in parallel.

Reasons that they are not parallel could be

rounding (OTOH 1024 and 256 are quite round numbers) that the blocks cannot be combined as arbitrary numbers, but only a multiple of some size.
Another reason could be small CUDA kernels running in the background by the operating system (but wouldn’t Nsight Systems also detect those?)

The aim should be to get any configuration, which runs the streams or the kernels in parallel. And afterwards increase the block sizes again.

hstlr · March 6, 2025, 1:20pm

Same result… Even with blockSize = 640, the second H2D copy operation only starts after the ptInPoly kernel has finished. Strangely, the overall runtime decreased by about 0.1 seconds for blockSize = 640; blockSizeCpy = 128 compared to blockSize = 896; blockSizeCpy = 128. Apparently, the ptInPoly get executed somewhat quicker. Anyway, maybe this is another issue.

Even the following configuration doesn’t give me the best possible parallelism:

int blockSize = 640;
int numBlocks = 8;
const int blockSizeCpy = 128;
const int numBlocksCpy = 8;

I mean…this way the GPU should be under-occupied all the time, right?

Topic		Replies	Views
Can't get any concurrency on simple vector add across multi-GPU and streams CUDA Programming and Performance	17	5918	April 28, 2012
cpu function and gpu kernel overlap CUDA Programming and Performance	12	1463	July 25, 2017
how to speed up? data transfer CUDA Programming and Performance	22	4013	April 5, 2011
streams vs. direct use of zero copy memory CUDA Programming and Performance	14	13260	March 30, 2011
Some CUDA/GPU implementation related questions CUDA Programming and Performance	6	2325	May 30, 2009
bets way to return a float value sync or assync CUDA Programming and Performance	26	10472	May 7, 2009
cudaMemcpyAsync CUDA Programming and Performance	10	21591	October 16, 2015
Transfer-Bound Application Looking for ideas to speed it up CUDA Programming and Performance	36	29482	April 23, 2010
about streaming style sample code in Programming Guide ... why such a style? CUDA Programming and Performance	5	1465	January 23, 2009
multi-GPUs with streams. Seems only one device overlapping copies CUDA Programming and Performance	9	1696	October 30, 2015

Why doesn't overlapping data transfers and kernel execution work here?

First a medium issue

In general

Related topics