Possibilities to further optimize PoC programme using custom copy kernels

I did some more digging under Ubuntu.

ChatGPT had also suggested a few other possible reasons for my issues, one of them being a limit for page-locking memory. I checked this and learned that currently, my Ubuntu installation allows for ~3.88 GiB of memory to page-lock.

Max Page-locked memory limit in my Ubuntu installation

So, this does not seem to be the problem.

However, after some more testing, I noticed that my code does indeed work with my Ubuntu installation for test data of up to about 268,000,000 PointXY objects. At 269,000,000 objects, the code fails.

Testing session Ubuntu 24.04

I did not notice this earlier because my standard test case is using 300,000,000 PointXY objects, which does work under Windows.

So, the issue might somehow arise due to my GPU’s memory limits, although I don’t really get the math.

Mem space taken up in different scenarios

Following C++'s sizeof() function, one PointXY object uses 8 bytes of memory space. This makes sense, seeing that it basically consists of 2 float variables.

So, my code allocates n * 8 bytes of memory for my PointXY objects plus n * 1 bytes for the results. This is assuming that CUDA stores each member of a bool array in 1 byte.

n = 268,000,000
n * 8 + n * 1 = approx. 2300.26 MiB
n = 269,000,000
n * 8 + n * 1 = approx. 2308.85 MiB

Both shown scenarios don’t closely use up all memory available on my GPU.

Also, the screenshot showing a console session of my code posted above shows that under Ubuntu, CUDA has 3781 MiB of GPU memory at its disposal while on the same machine, Windows states to have 4095 MiB available.

If you still need a working code example, let me know.

[EDIT:]

I did btw correct that. My PointXY class now looks as follows:

PointXY
class PointXY {
    // Properties
public:
    float2 coordinates;

    // Constructors
    __host__ __device__ PointXY(float x, float y) {
        this->coordinates.x = x;
        this->coordinates.y = y;
    }

    __host__ __device__ PointXY(float2 coordinates) {
        this->coordinates.x = coordinates.x;
        this->coordinates.y = coordinates.y;
    }

    PointXY() {
        //
    }

    // Getters
    __host__ __device__ float getX() {
        return this->coordinates.x;
    }

    __host__ __device__ float getY() {
        return this->coordinates.y;
    }

    __host__ __device__ float2 getCoordinates() {
        return this->coordinates;
    }
};