This thread is somewhat of a continuation of the discussion started here. As the original question of the linked thread (“Why doesn’t overlapping work here?”) has been answered a while ago, I suppose it’s reasonible to switch to a new thread.
So, for anybody new coming in, I’ll quickly sum up what this is about:
I’m exploring the workflow and benefits of CUDA for my company. To do that, I’m developing a proof of concept programme calculating the positioning of several (millions of) geometrical points relative to a polyline (“point in poly”). The ultimate goal will be to compare the performance of the CUDA version of this programme with the same operations running in parallel on the CPU.
The current status is that I have two different versions of the programme: One is using cudaMemcpyAsync
with pinned memory and several streams in order to take advantage of overlapping data transfers and computation. Because this does not seem to work as expected under Windows (while it does under Ubuntu), I have a second version, using custom copy kernels with pinned memory. Here, overlapping copying and computation works.
In this thread, I would like to present test results and what I gathered from them as well as possible ways to further optimize how I call the kernels (grid configuration, etc.) and maybe even the kernels themselves with anybody interested.
To get things started, see my most recent test results:
General test set up
GPU: RTX A2000 Laptop (4 GB VRAM)
No. of SMs: 20
maxThreadsPerSM: 1,536
Test data: 50,000,000 PointXY objects (8 bytes each)
chunksize for transfers/computation: 5,000,000
memory mode: pinned
I ran tests using 2 or 3 streams with several different grid configurations each. Each run consisted of 10 H2D
, ptInPoly
and D2H
operations. It’s worth mentioning that H2D
copying concerns the above mentioned amount of test data. D2H
comprises the same number of bool
values. Hence the huge difference in duration for these categories of operation. I will provide min, max and avg values for these categories of operations as well as the duration of the whole run and the time that was taken up by gaps within a stream, meaning there was no operation taking place. All durations are taken from Nsight Systems profilings.
Tests using 3 streams, numBlocks = 40
40 x 256 threads (H2D), 40 x 1024 threads (ptInPoly), 40 x 256 threads (D2H):
H2D:
- min: 7.396 ms
- max: 21.887 ms
- avg: 17.455 ms
ptInPoly:
- min: 2.148 ms
- max: 5.871 ms
- avg: 3.789 ms
D2H:
- min: 2.846 ms
- max: 10.641 ms
- avg: 6.307 ms
Overall duration: 109 ms
Gaps in between: 51.4 ms
40 x 384 threads (H2D), 40 x 896 threads (ptInPoly), 40 x 256 threads (D2H):
H2D:
- min: 6.267 ms
- max: 16.652 ms
- avg: 10.965 ms
ptInPoly:
- min: 2.237 ms
- max: 11.239 ms
- avg: 5.727 ms
D2H:
- min: 0.892 ms
- max: 15.189 ms
- avg: 5.71 ms
Overall duration: 102 ms
Gaps in between: 81.976 ms
40 x 512 threads (H2D), 40 x 768 threads (ptInPoly), 40 x 256 threads (D2H):
H2D:
- min: 7.777 ms
- max: 15.298 ms
- avg: 11.634 ms
ptInPoly:
- min: 2.18 ms
- max: 10.58 ms
- avg: 5.705 ms
D2H:
- min: 1.4 ms
- max: 15.402 ms
- avg: 6.849 ms
Overall duration: 107 ms
Gaps in between: 79.116 ms
Tests using 3 streams, numBlocks = 20
20 x 256 threads (H2D), 20 x 1024 threads (ptInPoly), 20 x 256 threads (D2H):
H2D:
- min: 8.616 ms
- max: 23.489 ms
- avg: 17.209 ms
ptInPoly:
- min: 2.155 ms
- max: 8.085 ms
- avg: 4.68 ms
D2H:
- min: 1.071 ms
- max: 10.088 ms
- avg: 5.181 ms
Overall duration: 103 ms
Gaps in between: 38.299 ms
20 x 384 threads (H2D), 20 x 896 threads (ptInPoly), 20 x 256 threads (D2H):
H2D:
- min: 8.144 ms
- max: 22.997 ms
- avg: 14.717 ms
ptInPoly:
- min: 2.245 ms
- max: 9.608 ms
- avg: 5.844 ms
D2H:
- min: 1.101 ms
- max: 18.573 ms
- avg: 7.329 ms
Overall duration: 106 ms
Gaps in between: 39.103 ms
20 x 512 threads (H2D), 20 x 768 threads (ptInPoly), 20 x 256 threads (D2H):
H2D:
- min: 8.629 ms
- max: 19.989 ms
- avg: 13.255 ms
ptInPoly:
- min: 2.392 ms
- max: 11.092 ms
- avg: 7.465 ms
D2H:
- min: 1.190 ms
- max: 18.816 ms
- avg: 7.937 ms
Overall duration: 106 ms
Gaps in between: 31.433 ms
20 x 640 threads (H2D), 20 x 640 threads (ptInPoly), 20 x 256 threads (D2H):
H2D:
- min: 8.172 ms
- max: 17.911 ms
- avg: 12.571 ms
ptInPoly:
- min: 2.635 ms
- max: 12.765 ms
- avg: 8.469 ms
D2H:
- min: 1.1 ms
- max: 18.728 ms
- avg: 7.338 ms
Overall duration: 107 ms
Gaps in between: 37.213 ms
20 x 640 threads (H2D), 20 x 512 threads (ptInPoly), 20 x 512 threads (D2H):
H2D:
- min: 8.271 ms
- max: 18.516 ms
- avg: 13.179 ms
ptInPoly:
- min: 3.057 ms
- max: 17.704 ms
- avg: 10.4 ms
D2H:
- min: 2.239 ms
- max: 8.602 ms
- avg: 5.636 ms
Overall duration: 116 ms
Gaps in between: 55.846 ms
Tests using 2 streams, numBlocks = 40
40 x 256 threads (H2D), 40 x 1024 threads (ptInPoly), 40 x 256 threads (D2H):
H2D:
- min: 8.688 ms
- max: 14.701 ms
- avg: 11.427 ms
ptInPoly:
- min: 1.874 ms
- max: 7.203 ms
- avg: 4.49 ms
D2H:
- min: 1.891 ms
- max: 7.747 ms
- avg: 5.269 ms
Overall duration: 117 ms
Gaps in between: 22.136 ms
40 x 384 threads (H2D), 40 x 896 threads (ptInPoly), 40 x 256 threads (D2H):
H2D:
- min: 9.018 ms
- max: 13.472 ms
- avg: 10.775 ms
ptInPoly:
- min: 1.591 ms
- max: 10.666 ms
- avg: 5.553 ms
D2H:
- min: 1.659 ms
- max: 11.607 ms
- avg: 6.459 ms
Overall duration: 130 ms
Gaps in between: 32.133 ms
40 x 512 threads (H2D), 40 x 768 threads (ptInPoly), 40 x 256 threads (D2H):
H2D:
- min: 8.946 ms
- max: 10.932 ms
- avg: 10.05 ms
ptInPoly:
- min: 2.172 ms
- max: 10.759 ms
- avg: 4.621 ms
D2H:
- min: 1.67 ms
- max: 10.991 ms
- avg: 6.27 ms
Overall duration: 124 ms
Gaps in between: 38.599 ms
40 x 640 threads (H2D), 40 x 640 threads (ptInPoly), 40 x 256 threads (D2H):
H2D:
- min: 10.102 ms
- max: 14.157 ms
- avg: 10.957 ms
ptInPoly:
- min: 2.237 ms
- max: 8.224 ms
- avg: 5.98 ms
D2H:
- min: 1.706 ms
- max: 8.57 ms
- avg: 5.406 ms
Overall duration: 126 ms
Gaps in between: 28.577 ms
40 x 640 threads (H2D), 40 x 512 threads (ptInPoly), 40 x 512 threads (D2H):
H2D:
- min: 8.926 ms
- max: 16.937 ms
- avg: 11.472 ms
ptInPoly:
- min: 1.613 ms
- max: 10.238 ms
- avg: 5.167 ms
D2H:
- min: 1.975 ms
- max: 11.323 ms
- avg: 5.494 ms
Overall duration: 125 ms
Gaps in between: 28.669 ms
Tests using 2 streams, numBlocks = 20
20 x 256 threads (H2D), 20 x 1024 threads (ptInPoly), 20 x 256 threads (D2H):
H2D:
- min: 9.262 ms
- max: 17.312 ms
- avg: 13.965 ms
ptInPoly:
- min: 2.158 ms
- max: 2.719 ms
- avg: 2.371 ms
D2H:
- min: 1.282 ms
- max: 4.793 ms
- avg: 3.505 ms
Overall duration: 109 ms
Gaps in between: 19.592 ms
20 x 384 threads (H2D), 20 x 896 threads (ptInPoly), 20 x 256 threads (D2H):
H2D:
- min: 9.062 ms
- max: 15.265 ms
- avg: 11.522 ms
ptInPoly:
- min: 2.244 ms
- max: 5.639 ms
- avg: 3.152 ms
D2H:
- min: 1.245 ms
- max: 7.771 ms
- avg: 5.658 ms
Overall duration: 110 ms
Gaps in between: 16.685 ms
20 x 512 threads (H2D), 20 x 768 threads (ptInPoly), 20 x 256 threads (D2H):
H2D:
- min: 8.744 ms
- max: 11.894 ms
- avg: 9.917 ms
ptInPoly:
- min: 2.384 ms
- max: 6.643 ms
- avg: 5.092 ms
D2H:
- min: 1.257 ms
- max: 10.934 ms
- avg: 5.394 ms
Overall duration: 110 ms
Gaps in between: 15.97 ms
20 x 640 threads (H2D), 20 x 640 threads (ptInPoly), 20 x 256 threads (D2H):
H2D:
- min: 8.83 ms
- max: 11.74 ms
- avg: 9.454 ms
ptInPoly:
- min: 2.636 ms
- max: 9.981 ms
- avg: 7.607 ms
D2H:
- min: 1.198 ms
- max: 11.92 ms
- avg: 4.315 ms
Overall duration: 114 ms
Gaps in between: 14.236 ms
20 x 640 threads (H2D), 20 x 512 threads (ptInPoly), 20 x 512 threads (D2H):
H2D:
- min: 8.711 ms
- max: 15.183 ms
- avg: 10.667 ms
ptInPoly:
- min: 2.64 ms
- max: 6.974 ms
- avg: 5.674 ms
D2H:
- min: 1.535 ms
- max: 8.175 ms
- avg: 4.769 ms
Overall duration: 114 ms
Gaps in between: 16.902 ms
Observations:
1.) Striking is how wide the range between minimum and maximum durations for many configuration is for H2D
and D2H
copying operations. Taking a look at the Nsight visualizations suggests that copying takes especially long when it’s taking place at the same time as a second copying operation. Simultaneously occuring ptInPoly
executions seem to affect the copying far less.
Maybe this hints at some lack of efficiency in the copy kernels?
2.) Overall, continuous occupation seems to work better (meaning: fewer gaps) with
numBlocks = 20
thannumBlocks = 40
and- using 2 streams than using 3 streams.
Of course, using 3 streams, there’s still more opportunity for parallelism, which would be why the best overall durations were achieved using 3 streams.
3.) The range of minimum to maximum durations for H2D
and D2H
copying operations seems to get notably closer using 2 streams. While a closer range could be something to go for, overall durations are still better using 3 streams. So I don’t really know whether that would be worthwhile.
So much for the first post in this thread. If you’re interested in discussing this more deeply, feel free to post your comments, suggestions or even requests.