Can someone help me understand this access pattern?

So I’m trying to read some source code and I’ve never seen access patterns like this in CUDA.

Here’s the code :

#ifdef REAL_TYPE_FP32
typedef float RealType;
#else
typedef double RealType;
#endif

////////////////////////////////////////////////////////////////////// Points
struct Point3
{
    RealType _p[ 3 ];

    bool lessThan( const Point3& pt ) const
    {
        if ( _p[0] < pt._p[0] ) return true; 
        if ( _p[0] > pt._p[0] ) return false; 
        if ( _p[1] < pt._p[1] ) return true; 
        if ( _p[1] > pt._p[1] ) return false; 
        if ( _p[2] < pt._p[2] ) return true; 

        return false; 
    }

    bool operator < ( const Point3& pt ) const
    {
        return lessThan( pt );
    }

    void printOut() const
    {
        std::cout << _p[0] << " " << _p[1] << " " << _p[2] << std::endl;
    }
};

/////////////////////////////////////////////// Some helper functions
#ifdef REAL_TYPE_FP32
struct __align__(8) RealType2 
{
    float _c0, _c1; 
};
#else
struct __align__(16) RealType2 
{
    double _c0, _c1; 
};
#endif
__forceinline__ __device__ 
Point3 loadPoint3( const Point3 *pointArr, int idx ) 
{
    if ( idx % 2 == 0 ) {
        const RealType2 ptxy = ( ( RealType2 *) pointArr )[ idx * 3 / 2 ]; 
        const Point3 pt = {
            ptxy._c0,
            ptxy._c1,
            pointArr[ idx ]._p[2] 
        }; 

        return pt; 
    } 
    else { 
        const RealType2 ptyz = ( ( RealType2 *) pointArr )[ idx * 3 / 2 + 1 ]; 
        const Point3 pt = {
            pointArr[ idx ]._p[0],
            ptyz._c0,
            ptyz._c1 
        }; 
        
        return pt; 
    }
}

Obviously, pointArr is an array of the point structures. I really don’t understand how this works. Namely, the 3/2 part.

Assuming we have a 1d array of floats where x,y,z are stored contiguously so we have storage like :

x0 y0 z0 x1 y1 z1 ...

,
I think we have the 3 because there’s 3 coordinates per structure. We want efficient loading from global memory so we break up the reads into a float2 (or double2) load and then a float (or double) load.

To translate the offset of a point in the buffer, it’s the point id * 3 (makes sense) but why are we dividing by 2? And why the even/odd stuff? Does anyone have any ideas? If there’s any ambiguity in the code, just ask and I can post more of the source.

Edit : Okay, looking at the code again, it seems like if the id of the point requested is even, we load the xy coordinates through a vectorized load otherwise if it’s odd we load the yz component. Why would we want it to be like that? Maybe I’m wrong about how the internal storage works. This seems like a really confusing way of doing this O_o

Is it because we’d have weird access patterns if we didn’t store it this way in global memory?

The size of a Point3 structure is 3/2 the size of a RealType2. If I have a idx that is originally referenced against a Point3 pointer, and then I “reinterpret” that idx to be referenced against a RealType2 pointer, and I want to retrieve the same value (that the Point3 idx was originally referring to), I must multiply the idx value by 3/2. If the original idx value is even, this just works. If it is odd, I must do some fixup.

Suppose I have:

|| x0 y0 z0 x1 y1 z1 x2 y2 z2 x3 y3 z3

the base type index:

||  0  1  2  3  4  5  6  7  8  9 10 11

the Point3 index

||  0        1        2        3

the RealType2 index:

||  0     1     2     3     4     5

Now suppose I am given the pointer (pointArr) pointing to the start of the sequence, and I want to retrieve the elements starting with x2. The Point3 index is 2. What should my RealType2 index be?

Thank you so much for illustrating the problem like that! It makes so much more sense now!

I get it now.

Isn’t this some clever code though? I would’ve never even thought to do it this way. I think the author is pretty decent at coding. I have a lot to learn.