Question on CUDA built-in vector types

On Windows 10, there’s a struct definition in <vector_types.h> like followings:

struct __device_builtin_ __builtin_align__(16) float4
{
float x, y, z, w;
}

My question is, if I create my own vector type like below and use it on device:

struct __align__(16) my_float4
{
float x, y, z, w;
}

does it yield the same performance compared to the built-in vector types?

If not, what makes the built-in vector types special?
How can I customize vector types of which performance is the same with the built-in version?

I don’t think there should be any difference. Why not test it?

I concur with Robert Crovella’s assessment. As I recall, the alignment attributes in CUDA source files are translated into the equivalent host-compiler specific attributes before the code is sent to the host compiler. You should be able to verify this by keeping and examining the intermediate files created during compilation of a .cu file.

I tested with a small script:

#include <iostream>
#include <chrono>
#include <cuda_runtime.h>

using namespace std;

template <typename T>
__global__ void _add(int N, const T* a, const T* b, T *c)
{
	int i = threadIdx.x + blockIdx.x * blockDim.x;

	if (i >= N) return;

	c[i].x = a[i].x + b[i].x;
	c[i].y = a[i].y + b[i].y;
	c[i].z = a[i].z + b[i].z;
	c[i].w = a[i].w + b[i].w;
}

template <typename T>
void _test(int N, int iter)
{
	using namespace chrono;

	T *a, *b, *c;

	cudaMalloc(&a, N * sizeof(T));
	cudaMalloc(&b, N * sizeof(T));
	cudaMalloc(&c, N * sizeof(T));

	dim3 blocks, threads;

	threads = { 1024, 1, 1 };
	blocks = { N / threads.x + 1, 1, 1 };

	auto init = steady_clock::now();

	for (int i = 0; i < iter; ++i) {
		_add <<< blocks, threads >>> (N, a, b, c);
	}

	cudaDeviceSynchronize();

	auto elapsed = duration_cast<milliseconds>(steady_clock::now() - init);

	cout << static_cast<double>(elapsed.count()) / 1000 << " seconds" << endl;
}

struct __align__(16) my_float4
{
	float x, y, z, w;
};

int main(int argc, char *argv[])
{
	int N = 1000000;
	int iter = 20;

	if (argc >= 2) N = atoi(argv[1]);
	if (argc >= 3) iter = atoi(argv[2]);
	if (argc >= 4) {
		cout << "FAIL" << endl;	::exit(EXIT_FAILURE); 
	}
	
	cout << "N = " << N << endl;
	cout << "ITER = " << iter << endl;

	cout << "Built-in float4" << endl;
	_test<float4>(N, iter);

	cout << "Custom float4" << endl;
	_test<my_float4>(N, iter);
}

The test is done on Linux (Ubuntu 16.04 LTS) and RTX 2080 Ti with CUDA 11.0.
The result reveals the custom and the built-in types have no difference, as followings:

N = 100000000
ITER = 1000
Built-in float4
8.809 seconds
Custom float4
8.811 seconds
N = 100000000
ITER = 10000
Built-in float4
88.07 seconds
Custom float4
87.958 seconds

One more question. What’s a role of the attribute “device_builtin”?

If you cannot find it described in the CUDA documentation: “builtin” attributes in various tool chains typically refer to features the compiler “knows” about, and thus handles in particular ways internally. So here it presumably means that float4 is not just some user-defined type, but a built-in type that is treated much the same as float or double.