Global Memory access throughput between 16 bytes and 32 bytes structure

According to the alignedTypes code samples released from Nvidia, I compare the global memory access throughput between the following two data structures 16 bytes and 32 bytes respectively.

typedef struct align(16) {
unsigned int r, g, b, a;

typedef struct align(16) {
RGBA32 c1, c2;

I allocate the same amount of memory for both data structures, which means that I have two times of RGBA32 compared with RGBA32_2. The throughput results are as the following.


Number of elements in array 3124992

Avg. time: 0.153469 ms / Copy throughput: 303.423411 GB/s.



Number of elements in array 1562496

Avg. time: 0.205531 ms / Copy throughput: 226.564128 GB/s.

My question is that in total, the number of memory request transactions should be the same for both data structures. What’s more, the hardware might be good at utilizing memory locality, which means that the throughput for RGBA32_2 should be better. However, the experiment results are opposed.