CUDA memory alignment performance and wrong output results

I tried to understand the GPU memory alignment mechanism by following the official cuda samples released from Nvidia. In the demo code, there are three types struct as the following .

typedef struct {
unsigned char r, g;
} RGBA8_misaligned;

typedef struct align(16) {
unsigned int r, g, b, a;
}
RGBA32;

typedef struct align(16) {
RGBA32 c1, c2;
}
RGBA32_2;

What we are trying to do is to copy the data from one array to the other as the following code, where the TData is the type of data structure I mentioned above.

template
global void testKernel(TData *d_odata, TData *d_idata, int numElements) {
const int tid = blockDim.x * blockIdx.x + threadIdx.x;
const int numThreads = blockDim.x * gridDim.x;

for (int pos = tid; pos < numElements; pos += numThreads) {
d_odata[pos] = d_idata[pos];
}
}

In the experiments, we allocate the same memory on the target device. My GPU card is V100 16GB and the code can be checked on this link GitHub - NVIDIA/cuda-samples: Samples for CUDA Developers which demonstrates features in CUDA Toolkit from last commit and the file name is called alignedTypes. The experiment results are as the following:

RGBA8_misaligned…

Number of elements in array 24999936

Avg. time: 0.730656 ms / Copy throughput: 63.731760 GB/s.

TEST FAILURE
RGBA32…

Number of elements in array 3124992

Avg. time: 0.153781 ms / Copy throughput: 302.806808 GB/s.

TEST OK

RGBA32_2…

Number of elements in array 1562496

Avg. time: 0.203812 ms / Copy throughput: 228.474753 GB/s.

TEST OK

I have two questions.

  1. If we only put two unsigned chars into the structure like RGBA8_misaligned, it cannot be copied successfully.
  2. The throughput of RGBA32_2 is much lower than RGBA32. I do not get the exact reason.