Possibilities to further optimize PoC programme using custom copy kernels

This thread is somewhat of a continuation of the discussion started here. As the original question of the linked thread (“Why doesn’t overlapping work here?”) has been answered a while ago, I suppose it’s reasonible to switch to a new thread.

So, for anybody new coming in, I’ll quickly sum up what this is about:
I’m exploring the workflow and benefits of CUDA for my company. To do that, I’m developing a proof of concept programme calculating the positioning of several (millions of) geometrical points relative to a polyline (“point in poly”). The ultimate goal will be to compare the performance of the CUDA version of this programme with the same operations running in parallel on the CPU.

The current status is that I have two different versions of the programme: One is using cudaMemcpyAsync with pinned memory and several streams in order to take advantage of overlapping data transfers and computation. Because this does not seem to work as expected under Windows (while it does under Ubuntu), I have a second version, using custom copy kernels with pinned memory. Here, overlapping copying and computation works.

In this thread, I would like to present test results and what I gathered from them as well as possible ways to further optimize how I call the kernels (grid configuration, etc.) and maybe even the kernels themselves with anybody interested.

To get things started, see my most recent test results:

General test set up

GPU: RTX A2000 Laptop (4 GB VRAM)
No. of SMs: 20
maxThreadsPerSM: 1,536
Test data: 50,000,000 PointXY objects (8 bytes each)
chunksize for transfers/computation: 5,000,000
memory mode: pinned

I ran tests using 2 or 3 streams with several different grid configurations each. Each run consisted of 10 H2D, ptInPoly and D2H operations. It’s worth mentioning that H2D copying concerns the above mentioned amount of test data. D2H comprises the same number of bool values. Hence the huge difference in duration for these categories of operation. I will provide min, max and avg values for these categories of operations as well as the duration of the whole run and the time that was taken up by gaps within a stream, meaning there was no operation taking place. All durations are taken from Nsight Systems profilings.

Tests using 3 streams, numBlocks = 40

40 x 256 threads (H2D), 40 x 1024 threads (ptInPoly), 40 x 256 threads (D2H):
H2D:

  • min: 7.396 ms
  • max: 21.887 ms
  • avg: 17.455 ms

ptInPoly:

  • min: 2.148 ms
  • max: 5.871 ms
  • avg: 3.789 ms

D2H:

  • min: 2.846 ms
  • max: 10.641 ms
  • avg: 6.307 ms

Overall duration: 109 ms
Gaps in between: 51.4 ms


40 x 384 threads (H2D), 40 x 896 threads (ptInPoly), 40 x 256 threads (D2H):
H2D:

  • min: 6.267 ms
  • max: 16.652 ms
  • avg: 10.965 ms

ptInPoly:

  • min: 2.237 ms
  • max: 11.239 ms
  • avg: 5.727 ms

D2H:

  • min: 0.892 ms
  • max: 15.189 ms
  • avg: 5.71 ms

Overall duration: 102 ms
Gaps in between: 81.976 ms


40 x 512 threads (H2D), 40 x 768 threads (ptInPoly), 40 x 256 threads (D2H):
H2D:

  • min: 7.777 ms
  • max: 15.298 ms
  • avg: 11.634 ms

ptInPoly:

  • min: 2.18 ms
  • max: 10.58 ms
  • avg: 5.705 ms

D2H:

  • min: 1.4 ms
  • max: 15.402 ms
  • avg: 6.849 ms

Overall duration: 107 ms
Gaps in between: 79.116 ms

Tests using 3 streams, numBlocks = 20

20 x 256 threads (H2D), 20 x 1024 threads (ptInPoly), 20 x 256 threads (D2H):
H2D:

  • min: 8.616 ms
  • max: 23.489 ms
  • avg: 17.209 ms

ptInPoly:

  • min: 2.155 ms
  • max: 8.085 ms
  • avg: 4.68 ms

D2H:

  • min: 1.071 ms
  • max: 10.088 ms
  • avg: 5.181 ms

Overall duration: 103 ms
Gaps in between: 38.299 ms


20 x 384 threads (H2D), 20 x 896 threads (ptInPoly), 20 x 256 threads (D2H):
H2D:

  • min: 8.144 ms
  • max: 22.997 ms
  • avg: 14.717 ms

ptInPoly:

  • min: 2.245 ms
  • max: 9.608 ms
  • avg: 5.844 ms

D2H:

  • min: 1.101 ms
  • max: 18.573 ms
  • avg: 7.329 ms

Overall duration: 106 ms
Gaps in between: 39.103 ms


20 x 512 threads (H2D), 20 x 768 threads (ptInPoly), 20 x 256 threads (D2H):
H2D:

  • min: 8.629 ms
  • max: 19.989 ms
  • avg: 13.255 ms

ptInPoly:

  • min: 2.392 ms
  • max: 11.092 ms
  • avg: 7.465 ms

D2H:

  • min: 1.190 ms
  • max: 18.816 ms
  • avg: 7.937 ms

Overall duration: 106 ms
Gaps in between: 31.433 ms


20 x 640 threads (H2D), 20 x 640 threads (ptInPoly), 20 x 256 threads (D2H):
H2D:

  • min: 8.172 ms
  • max: 17.911 ms
  • avg: 12.571 ms

ptInPoly:

  • min: 2.635 ms
  • max: 12.765 ms
  • avg: 8.469 ms

D2H:

  • min: 1.1 ms
  • max: 18.728 ms
  • avg: 7.338 ms

Overall duration: 107 ms
Gaps in between: 37.213 ms


20 x 640 threads (H2D), 20 x 512 threads (ptInPoly), 20 x 512 threads (D2H):
H2D:

  • min: 8.271 ms
  • max: 18.516 ms
  • avg: 13.179 ms

ptInPoly:

  • min: 3.057 ms
  • max: 17.704 ms
  • avg: 10.4 ms

D2H:

  • min: 2.239 ms
  • max: 8.602 ms
  • avg: 5.636 ms

Overall duration: 116 ms
Gaps in between: 55.846 ms

Tests using 2 streams, numBlocks = 40

40 x 256 threads (H2D), 40 x 1024 threads (ptInPoly), 40 x 256 threads (D2H):
H2D:

  • min: 8.688 ms
  • max: 14.701 ms
  • avg: 11.427 ms

ptInPoly:

  • min: 1.874 ms
  • max: 7.203 ms
  • avg: 4.49 ms

D2H:

  • min: 1.891 ms
  • max: 7.747 ms
  • avg: 5.269 ms

Overall duration: 117 ms
Gaps in between: 22.136 ms


40 x 384 threads (H2D), 40 x 896 threads (ptInPoly), 40 x 256 threads (D2H):
H2D:

  • min: 9.018 ms
  • max: 13.472 ms
  • avg: 10.775 ms

ptInPoly:

  • min: 1.591 ms
  • max: 10.666 ms
  • avg: 5.553 ms

D2H:

  • min: 1.659 ms
  • max: 11.607 ms
  • avg: 6.459 ms

Overall duration: 130 ms
Gaps in between: 32.133 ms


40 x 512 threads (H2D), 40 x 768 threads (ptInPoly), 40 x 256 threads (D2H):
H2D:

  • min: 8.946 ms
  • max: 10.932 ms
  • avg: 10.05 ms

ptInPoly:

  • min: 2.172 ms
  • max: 10.759 ms
  • avg: 4.621 ms

D2H:

  • min: 1.67 ms
  • max: 10.991 ms
  • avg: 6.27 ms

Overall duration: 124 ms
Gaps in between: 38.599 ms


40 x 640 threads (H2D), 40 x 640 threads (ptInPoly), 40 x 256 threads (D2H):
H2D:

  • min: 10.102 ms
  • max: 14.157 ms
  • avg: 10.957 ms

ptInPoly:

  • min: 2.237 ms
  • max: 8.224 ms
  • avg: 5.98 ms

D2H:

  • min: 1.706 ms
  • max: 8.57 ms
  • avg: 5.406 ms

Overall duration: 126 ms
Gaps in between: 28.577 ms


40 x 640 threads (H2D), 40 x 512 threads (ptInPoly), 40 x 512 threads (D2H):
H2D:

  • min: 8.926 ms
  • max: 16.937 ms
  • avg: 11.472 ms

ptInPoly:

  • min: 1.613 ms
  • max: 10.238 ms
  • avg: 5.167 ms

D2H:

  • min: 1.975 ms
  • max: 11.323 ms
  • avg: 5.494 ms

Overall duration: 125 ms
Gaps in between: 28.669 ms

Tests using 2 streams, numBlocks = 20

20 x 256 threads (H2D), 20 x 1024 threads (ptInPoly), 20 x 256 threads (D2H):
H2D:

  • min: 9.262 ms
  • max: 17.312 ms
  • avg: 13.965 ms

ptInPoly:

  • min: 2.158 ms
  • max: 2.719 ms
  • avg: 2.371 ms

D2H:

  • min: 1.282 ms
  • max: 4.793 ms
  • avg: 3.505 ms

Overall duration: 109 ms
Gaps in between: 19.592 ms


20 x 384 threads (H2D), 20 x 896 threads (ptInPoly), 20 x 256 threads (D2H):
H2D:

  • min: 9.062 ms
  • max: 15.265 ms
  • avg: 11.522 ms

ptInPoly:

  • min: 2.244 ms
  • max: 5.639 ms
  • avg: 3.152 ms

D2H:

  • min: 1.245 ms
  • max: 7.771 ms
  • avg: 5.658 ms

Overall duration: 110 ms
Gaps in between: 16.685 ms


20 x 512 threads (H2D), 20 x 768 threads (ptInPoly), 20 x 256 threads (D2H):
H2D:

  • min: 8.744 ms
  • max: 11.894 ms
  • avg: 9.917 ms

ptInPoly:

  • min: 2.384 ms
  • max: 6.643 ms
  • avg: 5.092 ms

D2H:

  • min: 1.257 ms
  • max: 10.934 ms
  • avg: 5.394 ms

Overall duration: 110 ms
Gaps in between: 15.97 ms


20 x 640 threads (H2D), 20 x 640 threads (ptInPoly), 20 x 256 threads (D2H):
H2D:

  • min: 8.83 ms
  • max: 11.74 ms
  • avg: 9.454 ms

ptInPoly:

  • min: 2.636 ms
  • max: 9.981 ms
  • avg: 7.607 ms

D2H:

  • min: 1.198 ms
  • max: 11.92 ms
  • avg: 4.315 ms

Overall duration: 114 ms
Gaps in between: 14.236 ms


20 x 640 threads (H2D), 20 x 512 threads (ptInPoly), 20 x 512 threads (D2H):
H2D:

  • min: 8.711 ms
  • max: 15.183 ms
  • avg: 10.667 ms

ptInPoly:

  • min: 2.64 ms
  • max: 6.974 ms
  • avg: 5.674 ms

D2H:

  • min: 1.535 ms
  • max: 8.175 ms
  • avg: 4.769 ms

Overall duration: 114 ms
Gaps in between: 16.902 ms

Observations:
1.) Striking is how wide the range between minimum and maximum durations for many configuration is for H2D and D2H copying operations. Taking a look at the Nsight visualizations suggests that copying takes especially long when it’s taking place at the same time as a second copying operation. Simultaneously occuring ptInPoly executions seem to affect the copying far less.
Maybe this hints at some lack of efficiency in the copy kernels?

2.) Overall, continuous occupation seems to work better (meaning: fewer gaps) with

  • numBlocks = 20 than numBlocks = 40 and
  • using 2 streams than using 3 streams.
    Of course, using 3 streams, there’s still more opportunity for parallelism, which would be why the best overall durations were achieved using 3 streams.

3.) The range of minimum to maximum durations for H2D and D2H copying operations seems to get notably closer using 2 streams. While a closer range could be something to go for, overall durations are still better using 3 streams. So I don’t really know whether that would be worthwhile.

So much for the first post in this thread. If you’re interested in discussing this more deeply, feel free to post your comments, suggestions or even requests.

1 Like

One variant could be to put the zero-copy storing directly into the computation kernel. Storing is easier for the threads to do, as it is (compared to loading) a fire-and-forget operation (whereas loading has to ‘wait’ for the result, at least at the next dependency).

Also the data size of D2H is smaller than H2D.
(When storing booleans, make sure that they are compressed, e.g. one operation stores at least 128 bytes per warp that is 1024 booleans per warp with one instruction.)

By fusing computation and D2H, you can work with 2 streams.
H2D according to your numbers seems to be slower than compute. So overlapping H2D with the other two could work well.

For reading (H2D) you could try out to use asynchronous memory copies on the device (1. Introduction — CUDA C++ Programming Guide), whether it improves transfer rate.

You mean directly storing the result into host memory within the ptInPoly kernel at the end of each iteration?

…or not after each iteration, but rather after 1024 iterations. Either way - interesting thought!

That’s referring back to the first suggestion in your post, right?

Btw… Can I in any way abstract the findings I made about grid configurations? Meaning: Are they valid exclusively for the very GPU model I used, or can I transfer them to other models, maybe at least the ones with the same Compute Capability?

Yes!

Is it one boolean per warp (32 threads) per iteration?
Regardless, the 128 bytes to store is small enough so that it can be created and buffered in registers (4 bytes per thread) or in shared memory (128 bytes per warp) or even in global device (non-host) memory.

As they were dependent on the WDDM Windows driver, it is probably difficult to predict.
I would expect you to get similar results with the same Windows (sub)version with similar GPUs (e.g. Ampere or with Ada class (non data-center; no A30 or A100)). But it is just a guess.

Well, good question.

This is the code of the ptInPoly kernel
__global__ void ptInPolyKernel(PointXY* ptRequest, long iChunkSize, PointXY* ptPoly, long cPtPoly, bool* bPtRequestIn, int offset) {
    long polySize = cPtPoly;

    int startIndex = offset + blockIdx.x * blockDim.x + threadIdx.x;
    int stride = blockDim.x * gridDim.x;

    float UNSCH = 0.01;

    for (long i = startIndex; i < offset + iChunkSize; i += stride) {
        float dAngle = 0;
        float dAngleTotal = 0;
        bool bContinueOuterLoop = false;

        for (int iPoint = 0; iPoint < polySize; iPoint++)
        {
            // aktuelles Liniensegment holen
            PointXY ptCurr = ptPoly[iPoint];
            PointXY ptNext = iPoint == polySize - 1 ? ptPoly[0] : ptPoly[iPoint + 1];

            // Wenn Punkt auf Linie liegt
            if (gpu_PunktGleich(ptRequest[i], ptCurr, UNSCH) || gpu_PunktGleich(ptRequest[i], ptNext, UNSCH) || gpu_PtAufLinieP1P2Unscharf(ptRequest[i], ptCurr, ptNext, UNSCH))
            {
                bPtRequestIn[i] = true;
                bContinueOuterLoop = true;
                continue;
            }

            dAngle = gpu_winkel3Punkte(ptRequest[i], ptCurr, ptNext, UNSCH);
            dAngleTotal += dAngle;
        }

        if (bContinueOuterLoop) {
            continue;
        }

        if (abs(dAngleTotal) > 0.1)
        {
            bPtRequestIn[i] = true;
        }
        else {
            bPtRequestIn[i] = false;
        }
    } 
}

So… I understand that each block of threads covers one complete outer for-loop, each single thread covers those iterations of the outer for-loop defined by the stride pattern, right? Either way, it’s one boolean per iteration as one iteration refers to one PointXY object. Howsoever the single warps come into this…

So, I would add a second, similar stride loop at the end of the kernel function, pushing all calculated booleans to host memory, right?

Huh… That means, in order to get maximum performance, you’d have to test each grid configuration on the actual target system? This would not at all be feasible for a software that’s supposed to go to end users because you’d never know the exact hardware and software setup beforehand. I was hoping the Compute Capability would at least offer some guidance concerning the number of SMs, max blocks per SM, max threads per SM and such.

I think you get one boolean per i and each thread (including each thread of a warp) has a different i.
(Or several different i with the stride distance.)

With __ballot_sync() you can combine the 32 boolean results of 32 threads of a warp into one 32-bit unsigned int.

You can store the 32-bit integer into shared memory, until you have 32 of those and then send 32 unsigned ints (or = 128 byte or = 1024 bits) to host memory with one instruction. You can also do it differently, that is just one example. Also depends on how to pack or combine the bits in the output memory. Also depends on which thread is responsible for which input data. Possibly you have to modify the stride.

As I said, I believe that it is similar with Ampere (same 8.6) or Ada class (8.9, same major version) GPUs.
As Windows is (probably) responsible for limiting the concurrent stream execution with WDDM on the software side, it can depend on the Windows sub-version, and probably less on the hardware side.

The different GPUs of one compute capability have the same threads per SM, but they can have a different number of SMs. So you should query the number of SMs and set the number of blocks accordingly.

1 Like