what makes built-in vectors special?

isaaclee2313 · May 24, 2019, 1:15am

inside “cuda/include/vector_types.h”, built-in vectors are declared with one or two of:

__device_builtin__
__builtin_align__(n)
__align__(n)

one does each of the three do? can we declare custom built-in vectors as defined in “vector_types.h”?

njuffa · May 24, 2019, 2:18am

By definition, custom types cannot be built-in. You could certainly apply the align() attribute to custom types, but an even better way is probably to use the C+11 alignas specifier, which should be portable across the host and device portions of your code.

I have not tried that, though. Check the CUDA Best Practices Guide for any relevant advice on alignment.

isaaclee2313 · May 24, 2019, 4:56am

Thanks!

Do you happen to know what the three of them do? There is no info on the guide.

njuffa · May 24, 2019, 7:11am

While there are no recommendations in the Best Practices Guide, align() is documented from a functional perspective in the CUDA Programming Guide. For alignas you would want to look at the C++11 specification, or maybe a site like CPPreference: [url]https://en.cppreference.com/w/cpp/language/alignas[/url]

Anything with “builtin” typically refers to compiler internal mechanisms (compare for example the various “builtin” items in gcc). To my recollection, device_builtin has nothing to do with alignment, but is a function (rather than a data) attribute. I am not familiar with the compiler internals.

Your questions appear indicative of an XY problem. What are you actually trying to do here, and what issues have you encountered?

isaaclee2313 · May 24, 2019, 7:15am

I compared the SASS code and throughput of using the built-in float4 and custom structure four_floats:

struct four_floats{
  float a;
  float b;
  float c;
  float d;
};

When loading a single four_floats item from global memory:

four_floats four_floats_reg = four_floats_global[0];

Although this can theoretically be serviced in a single 16-byte global load, it is broken up into 4 4-byte loads in the SASS code:

LDG R30, [R2]			
...
LDG R26, [R4]
...
LDG R30, [R2+0x4]						
...
LDG R26, [R4+0x4]

Which makes the loads un-coalesced.

However, When loading a single float4 item from global memory:

float4 float4_reg = float4_global[0];

The entire 16-byte load is serviced in one 16-byte load transaction:

LDG.128 R12, [R12]

so I wanted to know what is causing this difference in SASS generation.

njuffa · May 24, 2019, 7:25am

As the CUDA documentation points out, all accesses data on the GPU needs to be naturally aligned: an N-byte quantity needs to be aligned to an N-byte boundary. If you have a simple struct of four ‘float’ components, the struct will be aligned to a 4-byte boundary (as the largest element is a 4-byte ‘float’), and so wide vector loads cannot be used. If you add an alignment attribute to your struct that specifies 16-byte alignment, the compiler is able to generate a 16-byte load instruction.

I would suggest spending some quality time with the CUDA documentation, it answers a surprisingly large number of questions that programmers might have.

Topic		Replies	Views
Question on CUDA built-in vector types CUDA Programming and Performance	4	1390	September 10, 2021
built-in vector tags CUDA Programming and Performance	0	287	May 23, 2019
built-in vector tags CUDA Programming and Performance	0	273	May 23, 2019
Best practice with CUDA vector types. CUDA Programming and Performance	4	3857	April 4, 2013
Preferred alignment for buffers OptiX	5	1673	June 14, 2022
questions on built-in vector tags CUDA Programming and Performance	0	278	May 23, 2019
questions on built-in vector tags CUDA Programming and Performance	0	281	May 23, 2019
mathematical operation in built-in vector type? CUDA Programming and Performance	9	5562	June 29, 2009
struct member alignment inconsistent using templates CUDA Programming and Performance	3	1540	March 10, 2013
Alignment requirements, shared memory CUDA Programming and Performance cuda	11	356	September 2, 2024

what makes built-in vectors special?

Related topics