what makes built-in vectors special?

inside “cuda/include/vector_types.h”, built-in vectors are declared with one or two of:

__device_builtin__
__builtin_align__(n)
__align__(n)

one does each of the three do? can we declare custom built-in vectors as defined in “vector_types.h”?

By definition, custom types cannot be built-in. You could certainly apply the align() attribute to custom types, but an even better way is probably to use the C+11 alignas specifier, which should be portable across the host and device portions of your code.

I have not tried that, though. Check the CUDA Best Practices Guide for any relevant advice on alignment.

Thanks!

Do you happen to know what the three of them do? There is no info on the guide.

While there are no recommendations in the Best Practices Guide, align() is documented from a functional perspective in the CUDA Programming Guide. For alignas you would want to look at the C++11 specification, or maybe a site like CPPreference: https://en.cppreference.com/w/cpp/language/alignas

Anything with “builtin” typically refers to compiler internal mechanisms (compare for example the various “builtin” items in gcc). To my recollection, device_builtin has nothing to do with alignment, but is a function (rather than a data) attribute. I am not familiar with the compiler internals.

Your questions appear indicative of an XY problem. What are you actually trying to do here, and what issues have you encountered?

I compared the SASS code and throughput of using the built-in float4 and custom structure four_floats:

struct four_floats{
  float a;
  float b;
  float c;
  float d;
};

When loading a single four_floats item from global memory:

four_floats four_floats_reg = four_floats_global[0];

Although this can theoretically be serviced in a single 16-byte global load, it is broken up into 4 4-byte loads in the SASS code:

LDG R30, [R2]			
...
LDG R26, [R4]
...
LDG R30, [R2+0x4]						
...
LDG R26, [R4+0x4]

Which makes the loads un-coalesced.

However, When loading a single float4 item from global memory:

float4 float4_reg = float4_global[0];

The entire 16-byte load is serviced in one 16-byte load transaction:

LDG.128 R12, [R12]

so I wanted to know what is causing this difference in SASS generation.

As the CUDA documentation points out, all accesses data on the GPU needs to be naturally aligned: an N-byte quantity needs to be aligned to an N-byte boundary. If you have a simple struct of four ‘float’ components, the struct will be aligned to a 4-byte boundary (as the largest element is a 4-byte ‘float’), and so wide vector loads cannot be used. If you add an alignment attribute to your struct that specifies 16-byte alignment, the compiler is able to generate a 16-byte load instruction.

I would suggest spending some quality time with the CUDA documentation, it answers a surprisingly large number of questions that programmers might have.