__align__ structure

After reading “Loading Structured Data Efficiently With CUDA” I wanted to implement structure aligning within a program of mine.
Actually, I have a structure defined as
typedef struct {
float a, b, c;
} anobject;

But I can’t align to 12bytes.
I’ve also looked at vector_types.h, and neither the float3 structure is aligned.
Will this affect performance in memory transfer?
Should I just use float3 instead of my object?
Should I pad the object so to align it to 16 bytes?

What would you suggest me?

Yes, float3 cannot be read as quickly as float4. However, there is a standard trick to cast your float3 pointer to a float pointer, and then read 3x the number of floats into a shared array (also float3* cast to float*). This is the best option.

If this is not possible for your program, then you will get better memory performance reading float4 variables. (Note that you will get even better performance with float2 variables. Coalesced reads of 128-bit values have a lower bandwidth than 64 or 32-bit values, for some reason.)

Well, my program uses a lot of memory, so 1/4 more of memory used might be a problem.
However the application (a type of neural network) requires loading of hundreds of thousands objects like this (this represent a weight, with x, y, val values) each iteration, and its main problem is the memory bandwith bottleneck.
I am trying to coalesce reads as much as I can.
I’ll try to pack the whole structure as

typedef struct {
unsigned short x;
unsigned short y;
float val;
} …;

And pack it to 8 bytes.

Thank you for your answer.

I’m sorry, but I’d have another question.

Why does the following code
typedef struct align(8)
unsigned short x, y;
float val;

give me the error “expected unqualified-id before numeric constant”?

This declaration is included into an header file, which is included by both a .cu file and a .cpp file.
The error arises on compiling with g++, and it looks like it is related to the align keyword.


#ifdef CUDACC
typedef struct align(8)
typedef struct
unsigned short x, y;
float val;