Possibilities to further optimize PoC programme using custom copy kernels

This thread is somewhat of a continuation of the discussion started here. As the original question of the linked thread (“Why doesn’t overlapping work here?”) has been answered a while ago, I suppose it’s reasonible to switch to a new thread.

So, for anybody new coming in, I’ll quickly sum up what this is about:
I’m exploring the workflow and benefits of CUDA for my company. To do that, I’m developing a proof of concept programme calculating the positioning of several (millions of) geometrical points relative to a polyline (“point in poly”). The ultimate goal will be to compare the performance of the CUDA version of this programme with the same operations running in parallel on the CPU.

The current status is that I have two different versions of the programme: One is using cudaMemcpyAsync with pinned memory and several streams in order to take advantage of overlapping data transfers and computation. Because this does not seem to work as expected under Windows (while it does under Ubuntu), I have a second version, using custom copy kernels with pinned memory. Here, overlapping copying and computation works.

In this thread, I would like to present test results and what I gathered from them as well as possible ways to further optimize how I call the kernels (grid configuration, etc.) and maybe even the kernels themselves with anybody interested.

To get things started, see my most recent test results:

General test set up

GPU: RTX A2000 Laptop (4 GB VRAM)
No. of SMs: 20
maxThreadsPerSM: 1,536
Test data: 50,000,000 PointXY objects (8 bytes each)
chunksize for transfers/computation: 5,000,000
memory mode: pinned

I ran tests using 2 or 3 streams with several different grid configurations each. Each run consisted of 10 H2D, ptInPoly and D2H operations. It’s worth mentioning that H2D copying concerns the above mentioned amount of test data. D2H comprises the same number of bool values. Hence the huge difference in duration for these categories of operation. I will provide min, max and avg values for these categories of operations as well as the duration of the whole run and the time that was taken up by gaps within a stream, meaning there was no operation taking place. All durations are taken from Nsight Systems profilings.

Tests using 3 streams, numBlocks = 40

40 x 256 threads (H2D), 40 x 1024 threads (ptInPoly), 40 x 256 threads (D2H):
H2D:

  • min: 7.396 ms
  • max: 21.887 ms
  • avg: 17.455 ms

ptInPoly:

  • min: 2.148 ms
  • max: 5.871 ms
  • avg: 3.789 ms

D2H:

  • min: 2.846 ms
  • max: 10.641 ms
  • avg: 6.307 ms

Overall duration: 109 ms
Gaps in between: 51.4 ms


40 x 384 threads (H2D), 40 x 896 threads (ptInPoly), 40 x 256 threads (D2H):
H2D:

  • min: 6.267 ms
  • max: 16.652 ms
  • avg: 10.965 ms

ptInPoly:

  • min: 2.237 ms
  • max: 11.239 ms
  • avg: 5.727 ms

D2H:

  • min: 0.892 ms
  • max: 15.189 ms
  • avg: 5.71 ms

Overall duration: 102 ms
Gaps in between: 81.976 ms


40 x 512 threads (H2D), 40 x 768 threads (ptInPoly), 40 x 256 threads (D2H):
H2D:

  • min: 7.777 ms
  • max: 15.298 ms
  • avg: 11.634 ms

ptInPoly:

  • min: 2.18 ms
  • max: 10.58 ms
  • avg: 5.705 ms

D2H:

  • min: 1.4 ms
  • max: 15.402 ms
  • avg: 6.849 ms

Overall duration: 107 ms
Gaps in between: 79.116 ms

Tests using 3 streams, numBlocks = 20

20 x 256 threads (H2D), 20 x 1024 threads (ptInPoly), 20 x 256 threads (D2H):
H2D:

  • min: 8.616 ms
  • max: 23.489 ms
  • avg: 17.209 ms

ptInPoly:

  • min: 2.155 ms
  • max: 8.085 ms
  • avg: 4.68 ms

D2H:

  • min: 1.071 ms
  • max: 10.088 ms
  • avg: 5.181 ms

Overall duration: 103 ms
Gaps in between: 38.299 ms


20 x 384 threads (H2D), 20 x 896 threads (ptInPoly), 20 x 256 threads (D2H):
H2D:

  • min: 8.144 ms
  • max: 22.997 ms
  • avg: 14.717 ms

ptInPoly:

  • min: 2.245 ms
  • max: 9.608 ms
  • avg: 5.844 ms

D2H:

  • min: 1.101 ms
  • max: 18.573 ms
  • avg: 7.329 ms

Overall duration: 106 ms
Gaps in between: 39.103 ms


20 x 512 threads (H2D), 20 x 768 threads (ptInPoly), 20 x 256 threads (D2H):
H2D:

  • min: 8.629 ms
  • max: 19.989 ms
  • avg: 13.255 ms

ptInPoly:

  • min: 2.392 ms
  • max: 11.092 ms
  • avg: 7.465 ms

D2H:

  • min: 1.190 ms
  • max: 18.816 ms
  • avg: 7.937 ms

Overall duration: 106 ms
Gaps in between: 31.433 ms


20 x 640 threads (H2D), 20 x 640 threads (ptInPoly), 20 x 256 threads (D2H):
H2D:

  • min: 8.172 ms
  • max: 17.911 ms
  • avg: 12.571 ms

ptInPoly:

  • min: 2.635 ms
  • max: 12.765 ms
  • avg: 8.469 ms

D2H:

  • min: 1.1 ms
  • max: 18.728 ms
  • avg: 7.338 ms

Overall duration: 107 ms
Gaps in between: 37.213 ms


20 x 640 threads (H2D), 20 x 512 threads (ptInPoly), 20 x 512 threads (D2H):
H2D:

  • min: 8.271 ms
  • max: 18.516 ms
  • avg: 13.179 ms

ptInPoly:

  • min: 3.057 ms
  • max: 17.704 ms
  • avg: 10.4 ms

D2H:

  • min: 2.239 ms
  • max: 8.602 ms
  • avg: 5.636 ms

Overall duration: 116 ms
Gaps in between: 55.846 ms

Tests using 2 streams, numBlocks = 40

40 x 256 threads (H2D), 40 x 1024 threads (ptInPoly), 40 x 256 threads (D2H):
H2D:

  • min: 8.688 ms
  • max: 14.701 ms
  • avg: 11.427 ms

ptInPoly:

  • min: 1.874 ms
  • max: 7.203 ms
  • avg: 4.49 ms

D2H:

  • min: 1.891 ms
  • max: 7.747 ms
  • avg: 5.269 ms

Overall duration: 117 ms
Gaps in between: 22.136 ms


40 x 384 threads (H2D), 40 x 896 threads (ptInPoly), 40 x 256 threads (D2H):
H2D:

  • min: 9.018 ms
  • max: 13.472 ms
  • avg: 10.775 ms

ptInPoly:

  • min: 1.591 ms
  • max: 10.666 ms
  • avg: 5.553 ms

D2H:

  • min: 1.659 ms
  • max: 11.607 ms
  • avg: 6.459 ms

Overall duration: 130 ms
Gaps in between: 32.133 ms


40 x 512 threads (H2D), 40 x 768 threads (ptInPoly), 40 x 256 threads (D2H):
H2D:

  • min: 8.946 ms
  • max: 10.932 ms
  • avg: 10.05 ms

ptInPoly:

  • min: 2.172 ms
  • max: 10.759 ms
  • avg: 4.621 ms

D2H:

  • min: 1.67 ms
  • max: 10.991 ms
  • avg: 6.27 ms

Overall duration: 124 ms
Gaps in between: 38.599 ms


40 x 640 threads (H2D), 40 x 640 threads (ptInPoly), 40 x 256 threads (D2H):
H2D:

  • min: 10.102 ms
  • max: 14.157 ms
  • avg: 10.957 ms

ptInPoly:

  • min: 2.237 ms
  • max: 8.224 ms
  • avg: 5.98 ms

D2H:

  • min: 1.706 ms
  • max: 8.57 ms
  • avg: 5.406 ms

Overall duration: 126 ms
Gaps in between: 28.577 ms


40 x 640 threads (H2D), 40 x 512 threads (ptInPoly), 40 x 512 threads (D2H):
H2D:

  • min: 8.926 ms
  • max: 16.937 ms
  • avg: 11.472 ms

ptInPoly:

  • min: 1.613 ms
  • max: 10.238 ms
  • avg: 5.167 ms

D2H:

  • min: 1.975 ms
  • max: 11.323 ms
  • avg: 5.494 ms

Overall duration: 125 ms
Gaps in between: 28.669 ms

Tests using 2 streams, numBlocks = 20

20 x 256 threads (H2D), 20 x 1024 threads (ptInPoly), 20 x 256 threads (D2H):
H2D:

  • min: 9.262 ms
  • max: 17.312 ms
  • avg: 13.965 ms

ptInPoly:

  • min: 2.158 ms
  • max: 2.719 ms
  • avg: 2.371 ms

D2H:

  • min: 1.282 ms
  • max: 4.793 ms
  • avg: 3.505 ms

Overall duration: 109 ms
Gaps in between: 19.592 ms


20 x 384 threads (H2D), 20 x 896 threads (ptInPoly), 20 x 256 threads (D2H):
H2D:

  • min: 9.062 ms
  • max: 15.265 ms
  • avg: 11.522 ms

ptInPoly:

  • min: 2.244 ms
  • max: 5.639 ms
  • avg: 3.152 ms

D2H:

  • min: 1.245 ms
  • max: 7.771 ms
  • avg: 5.658 ms

Overall duration: 110 ms
Gaps in between: 16.685 ms


20 x 512 threads (H2D), 20 x 768 threads (ptInPoly), 20 x 256 threads (D2H):
H2D:

  • min: 8.744 ms
  • max: 11.894 ms
  • avg: 9.917 ms

ptInPoly:

  • min: 2.384 ms
  • max: 6.643 ms
  • avg: 5.092 ms

D2H:

  • min: 1.257 ms
  • max: 10.934 ms
  • avg: 5.394 ms

Overall duration: 110 ms
Gaps in between: 15.97 ms


20 x 640 threads (H2D), 20 x 640 threads (ptInPoly), 20 x 256 threads (D2H):
H2D:

  • min: 8.83 ms
  • max: 11.74 ms
  • avg: 9.454 ms

ptInPoly:

  • min: 2.636 ms
  • max: 9.981 ms
  • avg: 7.607 ms

D2H:

  • min: 1.198 ms
  • max: 11.92 ms
  • avg: 4.315 ms

Overall duration: 114 ms
Gaps in between: 14.236 ms


20 x 640 threads (H2D), 20 x 512 threads (ptInPoly), 20 x 512 threads (D2H):
H2D:

  • min: 8.711 ms
  • max: 15.183 ms
  • avg: 10.667 ms

ptInPoly:

  • min: 2.64 ms
  • max: 6.974 ms
  • avg: 5.674 ms

D2H:

  • min: 1.535 ms
  • max: 8.175 ms
  • avg: 4.769 ms

Overall duration: 114 ms
Gaps in between: 16.902 ms

Observations:
1.) Striking is how wide the range between minimum and maximum durations for many configuration is for H2D and D2H copying operations. Taking a look at the Nsight visualizations suggests that copying takes especially long when it’s taking place at the same time as a second copying operation. Simultaneously occuring ptInPoly executions seem to affect the copying far less.
Maybe this hints at some lack of efficiency in the copy kernels?

2.) Overall, continuous occupation seems to work better (meaning: fewer gaps) with

  • numBlocks = 20 than numBlocks = 40 and
  • using 2 streams than using 3 streams.
    Of course, using 3 streams, there’s still more opportunity for parallelism, which would be why the best overall durations were achieved using 3 streams.

3.) The range of minimum to maximum durations for H2D and D2H copying operations seems to get notably closer using 2 streams. While a closer range could be something to go for, overall durations are still better using 3 streams. So I don’t really know whether that would be worthwhile.

So much for the first post in this thread. If you’re interested in discussing this more deeply, feel free to post your comments, suggestions or even requests.

1 Like

One variant could be to put the zero-copy storing directly into the computation kernel. Storing is easier for the threads to do, as it is (compared to loading) a fire-and-forget operation (whereas loading has to ‘wait’ for the result, at least at the next dependency).

Also the data size of D2H is smaller than H2D.
(When storing booleans, make sure that they are compressed, e.g. one operation stores at least 128 bytes per warp that is 1024 booleans per warp with one instruction.)

By fusing computation and D2H, you can work with 2 streams.
H2D according to your numbers seems to be slower than compute. So overlapping H2D with the other two could work well.

For reading (H2D) you could try out to use asynchronous memory copies on the device (1. Introduction — CUDA C++ Programming Guide), whether it improves transfer rate.

1 Like

You mean directly storing the result into host memory within the ptInPoly kernel at the end of each iteration?

…or not after each iteration, but rather after 1024 iterations. Either way - interesting thought!

That’s referring back to the first suggestion in your post, right?

Btw… Can I in any way abstract the findings I made about grid configurations? Meaning: Are they valid exclusively for the very GPU model I used, or can I transfer them to other models, maybe at least the ones with the same Compute Capability?

Yes!

Is it one boolean per warp (32 threads) per iteration?
Regardless, the 128 bytes to store is small enough so that it can be created and buffered in registers (4 bytes per thread) or in shared memory (128 bytes per warp) or even in global device (non-host) memory.

As they were dependent on the WDDM Windows driver, it is probably difficult to predict.
I would expect you to get similar results with the same Windows (sub)version with similar GPUs (e.g. Ampere or with Ada class (non data-center; no A30 or A100)). But it is just a guess.

Well, good question.

This is the code of the ptInPoly kernel
__global__ void ptInPolyKernel(PointXY* ptRequest, long iChunkSize, PointXY* ptPoly, long cPtPoly, bool* bPtRequestIn, int offset) {
    long polySize = cPtPoly;

    int startIndex = offset + blockIdx.x * blockDim.x + threadIdx.x;
    int stride = blockDim.x * gridDim.x;

    float UNSCH = 0.01;

    for (long i = startIndex; i < offset + iChunkSize; i += stride) {
        float dAngle = 0;
        float dAngleTotal = 0;
        bool bContinueOuterLoop = false;

        for (int iPoint = 0; iPoint < polySize; iPoint++)
        {
            // aktuelles Liniensegment holen
            PointXY ptCurr = ptPoly[iPoint];
            PointXY ptNext = iPoint == polySize - 1 ? ptPoly[0] : ptPoly[iPoint + 1];

            // Wenn Punkt auf Linie liegt
            if (gpu_PunktGleich(ptRequest[i], ptCurr, UNSCH) || gpu_PunktGleich(ptRequest[i], ptNext, UNSCH) || gpu_PtAufLinieP1P2Unscharf(ptRequest[i], ptCurr, ptNext, UNSCH))
            {
                bPtRequestIn[i] = true;
                bContinueOuterLoop = true;
                continue;
            }

            dAngle = gpu_winkel3Punkte(ptRequest[i], ptCurr, ptNext, UNSCH);
            dAngleTotal += dAngle;
        }

        if (bContinueOuterLoop) {
            continue;
        }

        if (abs(dAngleTotal) > 0.1)
        {
            bPtRequestIn[i] = true;
        }
        else {
            bPtRequestIn[i] = false;
        }
    } 
}

So… I understand that each block of threads covers one complete outer for-loop, each single thread covers those iterations of the outer for-loop defined by the stride pattern, right? Either way, it’s one boolean per iteration as one iteration refers to one PointXY object. Howsoever the single warps come into this…

So, I would add a second, similar stride loop at the end of the kernel function, pushing all calculated booleans to host memory, right?

Huh… That means, in order to get maximum performance, you’d have to test each grid configuration on the actual target system? This would not at all be feasible for a software that’s supposed to go to end users because you’d never know the exact hardware and software setup beforehand. I was hoping the Compute Capability would at least offer some guidance concerning the number of SMs, max blocks per SM, max threads per SM and such.

I think you get one boolean per i and each thread (including each thread of a warp) has a different i.
(Or several different i with the stride distance.)

With __ballot_sync() you can combine the 32 boolean results of 32 threads of a warp into one 32-bit unsigned int.

You can store the 32-bit integer into shared memory, until you have 32 of those and then send 32 unsigned ints (or = 128 byte or = 1024 bits) to host memory with one instruction. You can also do it differently, that is just one example. Also depends on how to pack or combine the bits in the output memory. Also depends on which thread is responsible for which input data. Possibly you have to modify the stride.

As I said, I believe that it is similar with Ampere (same 8.6) or Ada class (8.9, same major version) GPUs.
As Windows is (probably) responsible for limiting the concurrent stream execution with WDDM on the software side, it can depend on the Windows sub-version, and probably less on the hardware side.

The different GPUs of one compute capability have the same threads per SM, but they can have a different number of SMs. So you should query the number of SMs and set the number of blocks accordingly.

1 Like

Sooo, finally, I’m back working on this project.

Before I dive deeper into the optimization opportunities you posted lastly, I wanted to finally make the whole code portable to Linux for testing purposes. While most of it already was cross-platform compatible, especially the multi-threaded execution on the CPU was still based on the Microsoft-specific parallel_for()-loop.

Anyway. I got it running with OpenMP on Windows and Linux. BUT: When testing around under Ubuntu 24.04, I got CUDA’s invalid argument error message for the following lines of code (abbreviated to the essential):

erroneous code
PointXY* ptRequest = nullptr;
bool* bPtRequestIn = nullptr;

cudaMallocHost((void**)&ptRequest, cPtRequest * sizeof(PointXY));
cudaMallocHost((void**)&bPtRequestIn, cPtRequest * sizeof(bool));

Now, after a surprisingly helpful conversation with ChatGPT, I suppose that the error lies in the way my PointXY class is built:

PointXY
__host__ __device__ class PointXY {
    // Properties
public:
    float2 coordinates;

    __host__ __device__ PointXY(float x, float y) {
        this->coordinates.x = x;
        this->coordinates.y = y;
    }

    __host__ __device__ PointXY(float2 coordinates) {
        this->coordinates.x = coordinates.x;
        this->coordinates.y = coordinates.y;
    }

    PointXY() {
        //
    }

    // Getters
    __host__ __device__ float getX() {
        return this->coordinates.x;
    }

    __host__ __device__ float getY() {
        return this->coordinates.y;
    }

    __host__ __device__ float2 getCoordinates() {
        return this->coordinates;
    }
};

ChatGPT tells me that allocating memory for classes that contain custom constructors or functions with cudaMallocHost() is not following Nvidias recommendations in the CUDA Programming Guide because these are not just plain old data. ChatGPT further lets me know that the very same code might (and does indeed) run flawlessly under Windows because MSVC compiler and the OS itself often are less strict in applying such rules than e. g. g++ is.

As my PointXY class only uses constructors and getter functions, I then asked ChatGPT if changing my code to using a struct, just containing one float2 member, would help. And if doing so would still ensure that my custom copy kernel…

copy kernel
__global__ void cpyToDeviceKernel(PointXY* __restrict__ ptRequestHost, PointXY* __restrict__ ptRequestDevice, long iChunkSize, int offset) {
    int startIndex = offset + blockIdx.x * blockDim.x + threadIdx.x;
    int stride = blockDim.x * gridDim.x;

    for (long i = startIndex; i < offset + iChunkSize; i += stride) {
        ptRequestDevice[i] = PointXY(ptRequestHost[i]); // code already changed to using a struct
    }
}

…performs a deep copy of the structs contents rather than a shallow copy.

Now, before I refactor all affected code, I would like you to briefly confirm ChatGPT’s information :-D

As I’m thinking about it… Instead of using a struct that contains one member I could just as well use arrays of type float2. This would mean a lot more refactoring though than just changing insides of the wrapper.

Not sure what that means. You can allocate for arrays of structs, or arrays of classes, just like you would with POD, and I’m not aware of any statements in the programming guide that contradict that.

I don’t see that you indicated what chatGPT said about that question.

That is incorrect syntax. We don’t mark class or struct definitions with ___host__ __device__. I would have expected a compiler indication about that.

Beyond that, I can’t debug snippets of code. A short, complete example would be more useful for me in understanding where a invalid argument error may have come from.

Regarding the code that you claim is erroneous: by itself, it does not produce the error you suggest, that I can see, on CUDA 12.2, linux:

# cat t378.cu

class PointXY {
    // Properties
public:
    float2 coordinates;

    __host__ __device__ PointXY(float x, float y) {
        this->coordinates.x = x;
        this->coordinates.y = y;
    }

    __host__ __device__ PointXY(float2 coordinates) {
        this->coordinates.x = coordinates.x;
        this->coordinates.y = coordinates.y;
    }

    PointXY() {
        //
    }

    // Getters
    __host__ __device__ float getX() {
        return this->coordinates.x;
    }

    __host__ __device__ float getY() {
        return this->coordinates.y;
    }

    __host__ __device__ float2 getCoordinates() {
        return this->coordinates;
    }
};


int main(){
  int cPtRequest = 1048576;
  PointXY* ptRequest = nullptr;
  bool* bPtRequestIn = nullptr;

  cudaMallocHost((void**)&ptRequest, cPtRequest * sizeof(PointXY));
  cudaMallocHost((void**)&bPtRequestIn, cPtRequest * sizeof(bool));
}
# nvcc -o t378 t378.cu
# compute-sanitizer ./t378
========= COMPUTE-SANITIZER
========= ERROR SUMMARY: 0 errors
#

If I were to leave in the unnecessary __host__ __device__ decoration on the class definition, I get warnings like this:

# nvcc -o t378 t378.cu
t378.cu(2): warning #1866-D: attribute does not apply to any entity
  __attribute__((host)) __attribute__((device)) class PointXY {
                                       ^

Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"

t378.cu(2): warning #1866-D: attribute does not apply to any entity
  __attribute__((host)) __attribute__((device)) class PointXY {
                                       ^

Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"

I don’t know if you are getting warnings like that, but warnings could be a useful clue.

1 Like

Hm. ChatGPT supposedly quotes from the Nvidia Programming Guide, chapter Memory Management:

“Non-trivial C++ types (with constructors, destructors, or virtual functions) are not safe to use with pinned memory or device memory allocations.”

So far, I can’t find this quote or where else it might come from. Meanwhiles, a colleague sends me an article about how ChatGPT has been hallucinating ever more in the latest versions…

You’re right, sorry for that. ChatGPT answered that while float2 is a CUDA-specific type, it’s still handled as a simple type that gets copied by value. So yes, following ChatGPT, copying elements of a struct containing float2 by using the = operator would perform a deep copy.

I do get warnings about that at compile time, yes. But as I thought this was the correct way to make a class available in CPU and GPU code and they were only warnings, I thought this wouldn’t be a problem. At least under Windows, I didn’t have any problems at run time with this syntax.

The warning reads as follows:

warning #1866-D: attribute does not apply to any entity
__declspec(__host__) __declspec(__device__) class PointXY {

I’m afraid, I will only be able to provide a working minimal example on Monday, though.

[EDIT:]
When asked again for clarification, ChatGPT states that the quoted passage could indeed not be found literally in the CUDA Programming Guide. Instead, it retreats to community recommendations and such.

It even quotes you, @Robert_Crovella

In mehreren Threads auf den offiziellen NVIDIA Developer-Foren, u. a. von CUDA-Entwicklern selbst, wird explizit empfohlen:

  • nur triviale C-ähnliche Typen (PODs) mit cudaMallocHost() zu verwenden.
  • keine Klassen mit Konstruktoren, Destruktoren, Vererbung etc., da cudaMallocHost() keine Initialisierung vornimmt.

Ein häufig zitierter Punkt ist:

“Pinned memory is raw memory — there is no placement-new or constructor invocation. Non-trivial types may not behave as expected.”

Diese Aussagen stammen von z. B. Robert Crovella, einem bekannten NVIDIA CUDA Engineer.

📄 Beispiel: https://forums.developer.nvidia.com/t/using-std-vector-with-cudamallochost/203379

Bummer that the provided link doesn’t work…

That statement (specifically: a class with virtual functions) does require special attention. It is referred to here. That doesn’t apply to your code that I can see (right? You have depicted no classes with virtual functions.)

I also can’t open or find the provided link (203379)

It is correct that for classes with virtual functions, an ordinary allocator (used to provide storage space for objects of the class) may not properly initialize a class object. Non-virtual functions should not be an issue, because in that case the function call mechanism for object methods does not depend on the initialization of the object itself. But for virtual functions, it does depend on the object instance initialization.

None of this discussion applies to your case, that I can see. Furthermore, even if you violate some principle here, the expected outcome for me is not “invalid argument” on the call to cudaMallocHost.

And you have so far not provided an example code, that I can see or test, that actually results in that error.

You could test further by reducing to the cudaMallocHost of one of the types.
Then first calculate both arguments as separate expressions, e.g.

void PointXY** pptRequest = &ptRequest;
size_t cPointXY = sizeof(PointXY);
size_t ccPtRequest = cPtRequest * cPointXY;

Let us see from which actual expression the error comes from.

I did some more digging under Ubuntu.

ChatGPT had also suggested a few other possible reasons for my issues, one of them being a limit for page-locking memory. I checked this and learned that currently, my Ubuntu installation allows for ~3.88 GiB of memory to page-lock.

Max Page-locked memory limit in my Ubuntu installation

So, this does not seem to be the problem.

However, after some more testing, I noticed that my code does indeed work with my Ubuntu installation for test data of up to about 268,000,000 PointXY objects. At 269,000,000 objects, the code fails.

Testing session Ubuntu 24.04

I did not notice this earlier because my standard test case is using 300,000,000 PointXY objects, which does work under Windows.

So, the issue might somehow arise due to my GPU’s memory limits, although I don’t really get the math.

Mem space taken up in different scenarios

Following C++'s sizeof() function, one PointXY object uses 8 bytes of memory space. This makes sense, seeing that it basically consists of 2 float variables.

So, my code allocates n * 8 bytes of memory for my PointXY objects plus n * 1 bytes for the results. This is assuming that CUDA stores each member of a bool array in 1 byte.

n = 268,000,000
n * 8 + n * 1 = approx. 2300.26 MiB
n = 269,000,000
n * 8 + n * 1 = approx. 2308.85 MiB

Both shown scenarios don’t closely use up all memory available on my GPU.

Also, the screenshot showing a console session of my code posted above shows that under Ubuntu, CUDA has 3781 MiB of GPU memory at its disposal while on the same machine, Windows states to have 4095 MiB available.

If you still need a working code example, let me know.

[EDIT:]

I did btw correct that. My PointXY class now looks as follows:

PointXY
class PointXY {
    // Properties
public:
    float2 coordinates;

    // Constructors
    __host__ __device__ PointXY(float x, float y) {
        this->coordinates.x = x;
        this->coordinates.y = y;
    }

    __host__ __device__ PointXY(float2 coordinates) {
        this->coordinates.x = coordinates.x;
        this->coordinates.y = coordinates.y;
    }

    PointXY() {
        //
    }

    // Getters
    __host__ __device__ float getX() {
        return this->coordinates.x;
    }

    __host__ __device__ float getY() {
        return this->coordinates.y;
    }

    __host__ __device__ float2 getCoordinates() {
        return this->coordinates;
    }
};

Perhaps there is a general OS-wide and a process-specific limit for page-locked memory?

Or is the “Max locked memory” in the /proc hierarchy already process-specific (compared to showing the OS-environment limit, under which the process runs)?

Sometimes it can be set by /etc/security/limits.conf: memlock …
Or try to actively change it with ulimit -l

Even if you are not using up all of it, try to increase the limit beyond 4GiB to see, if it has an effect

This reply on Stack Exchange to a somewhat related question led me to the getrlimit(2) manual.
In the relevant section, I read the following:

getrlimit(2) manual

RLIMIT_MEMLOCK
This is the maximum number of bytes of memory that may be locked into RAM. This limit is in effect rounded down to the nearest multiple of the system page size. This limit affects mlock(2), mlockall(2), and the mmap(2) MAP_LOCKED operation. Since Linux 2.6.9, it also affects the shmctl(2) SHM_LOCK operation, where it sets a maximum on the total bytes in shared memory segments (see shmget(2)) that may be locked by the real user ID of the calling process. The shmctl(2) SHM_LOCK locks are accounted for separately from the per-process memory locks established by mlock(2), mlockall(2), and mmap(2) MAP_LOCKED; a process can lock bytes up to this limit in each of these two categories. Before Linux 2.6.9, this limit controlled the amount of memory that could be locked by a privileged process. Since Linux 2.6.9, no limits are placed on the amount of memory that a privileged process may lock, and this limit instead governs the amount of memory that an unprivileged process may lock.

Especially the last bit makes me think that the displayed limit is process-specific rather then encompassing all processes started in the specific context.

I will try that and get back to you.

As they distinguish between privileged and unprivileged processes, sudo could be another avenue to try out.

Meanwhiles, I’m back on this topic. I’m just reading about Global and Shared Memory. The Mark Harris post about Global Memory states that:

For large strides, the effective bandwidth is poor regardless of architecture version. This should not be surprising: when concurrent threads simultaneously access memory addresses that are very far apart in physical memory, then there is no chance for the hardware to combine the accesses.

Isn’t that exactly what I’m doing at the moment with my custom copy kernels?

I'm allocating pinned memory on the host...
cudaMallocHost((void**)&ptRequest, cPtRequest * sizeof(PointXY));   // Speicher für PointXY-Objekte
cudaMallocHost((void**)&bPtRequestIn, cPtRequest * sizeof(bool));   // Speicher für Ergebnisrückgabe
...allocating global memory on the GPU...
cudaMalloc((void**)&ptRequestDev, cPtRequest * sizeof(PointXY));
cudaMalloc((void**)&bPtRequestInDev, cPtRequest * sizeof(bool));
...and then copying data H2D, using a stride loop
__global__ void cpyToDeviceKernel(PointXY* __restrict__ ptRequestHost, PointXY* __restrict__ ptRequestDevice, long iChunkSize, int offset) {
    int startIndex = offset + blockIdx.x * blockDim.x + threadIdx.x;
    int stride = blockDim.x * gridDim.x;

    for (long i = startIndex; i < offset + iChunkSize; i += stride) {
        ptRequestDevice[i] = PointXY(ptRequestHost[i].getCoordinates());
    }
}

Or is this issue nowadays not relevant anymore?

In this sentence he said “simultaneous access”, “concurrent threads”, and “addresses that are very far apart”.

Your stride is not for simultaneous access, but for the iterations of the for loop.

Concurrent threads (of a warp) access consecutive addresses.

His code:

template
global void stride(T* a, int s)
{
int i = (blockDim.x * blockIdx.x + threadIdx.x) * s;
a[i] = a[i] + 1;
}

Your code

__global__ void cpyToDeviceKernel(PointXY* __restrict__ ptRequestHost, PointXY* __restrict__ ptRequestDevice, long iChunkSize, int offset) {
    int startIndex = offset + blockIdx.x * blockDim.x + threadIdx.x;
    int stride = blockDim.x * gridDim.x;

    for (long i = startIndex; i < offset + iChunkSize; i += stride) {
        ptRequestDevice[i] = PointXY(ptRequestHost[i].getCoordinates());
    }
}

Especially see, what happens to the threadIdx.x and the multiplication with the stride size in his code.

Isn’t his code just another way of setting up a stride loop, though?
When I put the function you quoted and the calling bit next to each other, to me, this looks quite similar to what I’m doing:

code snippets from article
for (int i = 1; i <= 32; i++) {
    checkCuda( cudaMemset(d_a, 0.0, n * sizeof(T)) );

    checkCuda( cudaEventRecord(startEvent,0) );
    stride<<>>(d_a, i);
    checkCuda( cudaEventRecord(stopEvent,0) );
    checkCuda( cudaEventSynchronize(stopEvent) );

    checkCuda( cudaEventElapsedTime(&ms, startEvent, stopEvent) );
    printf("%d, %fn", i, 2*nMB/ms);
  }

[...]

template 
__global__ void stride(T* a, int s)
{
  int i = (blockDim.x * blockIdx.x + threadIdx.x) * s;
  a[i] = a[i] + 1;
}

And assuming that “concurrent threads” means the ones launched within the same thread block, the memory addresses accessed simultaneously by two threads in my code could theoretically be as far apart as iChunkSize minus a few, right?

This is the important location in the code, where threadIdx.x is multiplied by the stride creating gaps between neighbouring threads.

To be exact, the accesses of the 32 lanes or threads of a warp are combined into 32-byte sectors.
If you have a stride between neighbouring threads, you need more sectors.

However the following may work:

tid = threadIdx.x; // 0..31
int idx = tid / 2; // index to an int array (32-bits)
if (tid & 1)
    idx += 512;

Now the accessed addresses of neighbouring threads jump back and forth, but altogether you have only 4 sectors: The ones starting at index 0, 8, 512, and 520. They span 0-15, 512-527.

There are some very minor performance improvements possible beyond the 32-byte rule (keeping 128-byte accesses together). And of course you want to use your caches, so overall you want to keep the memory accesses (not near, but) the same to each other.

Apart from that you can jump around the accesses, especially between different warps.

[In early Cuda architectures the rules for coalesced accesses were more strict.]

Okay, thank you for your insights. I can’t say that I fully understood the reasons why, but…

…these statements calm my worries in this respect.

Meanwhiles, I started implementing a combined compute-and-write-back kernel. Well, of course it’s more of a refactoring of my former stride kernel.
I followed this blog post concerning shared memory and would like to put what I got so far up to discussion.

compute-and-write-back kernel

I marked the bits that are different from the latest working kernel via comments.
So far, my kernel does not know where to copy the results (&bPtRequestInHost as of now is unknown here). If the direction I’m taking here is okay, I will introduce this information to the kernel.

__global__ void combinedPtInPolyStoreBackKernel(PointXY* ptRequest, long iChunkSize, PointXY* ptPoly, long cPtPoly, int offset) {
    long polySize = cPtPoly;

    int startIndex = offset + blockIdx.x * blockDim.x + threadIdx.x;
    int stride = blockDim.x * gridDim.x;

    float UNSCH = 0.01;

    extern __shared__ bool bPtRequestIn[];  // <-- NEW!

    for (long i = startIndex; i < offset + iChunkSize; i += stride) {
        float dAngle = 0;
        float dAngleTotal = 0;
        bool bContinueOuterLoop = false;

        for (int iPoint = 0; iPoint < polySize; iPoint++)
        {
            // aktuelles Liniensegment holen
            PointXY ptCurr = ptPoly[iPoint];
            PointXY ptNext = iPoint == polySize - 1 ? ptPoly[0] : ptPoly[iPoint + 1];

            // Wenn Punkt auf Linie liegt
            if (gpu_PunktGleich(ptRequest[i], ptCurr, UNSCH) || gpu_PunktGleich(ptRequest[i], ptNext, UNSCH) || gpu_PtAufLinieP1P2Unscharf(ptRequest[i], ptCurr, ptNext, UNSCH))
            {
                bPtRequestIn[i] = true;
                bContinueOuterLoop = true;
                continue;
            }

            dAngle = gpu_winkel3Punkte(ptRequest[i], ptCurr, ptNext, UNSCH);
            dAngleTotal += dAngle;
        }

        if (bContinueOuterLoop) {
            continue;
        }

        if (abs(dAngleTotal) > 0.1)
        {
            bPtRequestIn[i] = true;
        }
        else {
            bPtRequestIn[i] = false;
        }
    }

    __syncthreads();    // <-- NEW!
    if (threadIdx.x == 0) {
        checkCuda(cudaMemcpyAsync(&bPtRequestInHost[offset], bPtRequestIn, iChunkSize * sizeof(bool), cudaMemcpyDeviceToHost)); // <-- NEW!
    }
}
calling CPU code

This, too, is mostly similar to the code launching my custom copy kernels version. I did have to massively shrink the chunkSize, though, due to the limits of available shared memory per block/SM. More on that later in this post.

 const int iNumStreams = 2;
 cudaStream_t streams[iNumStreams];
 int blockSize = 1024;
 int numBlocks = devProps.multiProcessorCount;
 const int blockSizeCpyH2D = 256;
 const int numBlocksCpyH2D = devProps.multiProcessorCount * 2;
 int iChunkSize = devProps.sharedMemPerMultiprocessor * 8;   // <-- NEW! Shared Memory benötigt für Sammeln der Ergebnisse; 8 Booleans == 1 Byte Shared Memory

 for (int i = 0; i < iNumStreams; i++) {
     checkCuda(cudaStreamCreate(&streams[i]));
 }

 // Kernels ausführen
 long iLaunch = 0;
 bool bLaunchNext = true;

 while (bLaunchNext) {
     int offset = iLaunch * iChunkSize;
     int iCurrChunkSize = iChunkSize;
     int iCurrSharedMemSize = iCurrChunkSize / 8;    // <-- NEW! Shared Memory in Bytes
     int indexStream = iLaunch % iNumStreams;

     if (offset + iCurrChunkSize >= cPtRequest) {
         iCurrChunkSize = cPtRequest - offset;
         iCurrSharedMemSize = (int)roundUpToMultiple(iCurrChunkSize, 8); // <-- NEW! auf ganze Bytes aufrunden, damit alle Ergebnis-Booleans Platz finden
         bLaunchNext = false;
     }

     cpyToDeviceKernel << <numBlocksCpyH2D, blockSizeCpyH2D, 0, streams[indexStream] >> > (ptRequest, ptRequestDev, iCurrChunkSize, offset);
     combinedPtInPolyStoreBackKernel << <numBlocks, blockSize, iCurrSharedMemSize, streams[indexStream] >> > (ptRequestDev, iCurrChunkSize, ptPolyDev, cPtPoly, offset); // <-- NEW! 

     iLaunch++;
 }

[...]

Now, there’s a few things I’d like to clarify before testing this.

1.) I’m assuming I won’t have to manage shared memory myself in regard of releasing it after usage, right? Will the memory be available long enough for cudaMemcpyAsync() to finish?

2.) The way I’m doing it now, I’m obviously not minding any warp groupings of threads. Instead, I’m collecting all the results from one block and transferring them back to the Host in one. Is this there a downsight to this compared to doing it warp-wise as you suggested?

3.) As I mentioned, I’m bound to the limits of available shared memory here. In fact, the way the CPU code above is set up, I’m reserving the maximum of available shared memory for each block of threads. Still, I had to heavily decrease the used chunkSize: It went down from 5,000,000 to 102,400 (sharedMemPerSM in bytes) * 8 = 819,200. When launching 20 blocks (one for each SM) with 1024 threads each, this means that each thread will only calculate 40 PointXY objects. Will this negatively affect the kernel’s performance? Or is there another, better way of using shared memory?

I hope you guys can shed some light on this.