Why using vectorized loads is more efficient?

I read this article: https://developer.nvidia.com/blog/cuda-pro-tip-increase-performance-with-vectorized-memory-access and also did some tests with simple memcpy kernels accessing global memory either via float or via float4 reads/writes. Vectorized access gives the best performance. But I don’t understand why.

Here are the two kernels that I am comparing. CopyFloat4 is launched with 4 times less threads than CopyFloat.

global void CopyFloat(float* restrict dest, const float* restrict src)
{
int index = blockDim.x * blockIdx.x + threadIdx.x;

dest[index] = src[index];

}

global void CopyFloat4(float4* restrict dest, const float4* restrict src)
{
int index = blockDim.x * blockIdx.x + threadIdx.x;

dest[index] = src[index];

}

Since the scheduling is done at warp granularity, we need to look at how data accesses happen within a warp.

CopyFloat: each thread within a warp requests 4bytes => coalesced into 1x128byte memory transaction per warp
=> Total memory transactions: Thread_count / 32 = Element_count / 32
CopyFloat4: each thread within a warp requests 4x4bytes => 4x128byte memory transactions per warp
=> Total memory transactions: Thread_count / 32 * 4 = (Element_count / 4) / 32 * 4 = Element_count/32

So, for CopyFloat4, a warp will issue 4 times more memory transactions. But we have 4 times less warps, hence, in the end, we have the same amount of memory transactions for both kernels.

What am I missing, why is the vectorized kernel faster?

But we have 4 times less warps

Why?

Usually, when looking at performance, we want to do controlled experiments, in which just a single variable changes. So using a given piece of code, with a given run-time configuration, and exchanging narrow loads for wide loads (single variable change!), the wider loads will be more efficient in terms of hardware usage (fewer instructions fetched and executed, fewer memory transaction queued), and this is likely beneficial to performance. From my understanding, the way modern GPUs work (basically grouping narrower accesses together), any observable performance gain may be minimal though.

But we have 4 times less warps

Why?

We have a fixed amount of bytes that we copy with either CopyFloat or CopyFloat4. So, when using CopyFloat4, the kernel does 4 times more work, hence we will have 4 times less threads, 4 times less warps.

True, with vectorized memory accesses we have fewer instructions fetched and executed. However, instruction fetching/scheduling/execution is considerably less costly than actually doing reads/writes from global memory.

the wider loads will be more efficient in terms of hardware usage

Yes, probably that’s the explanation, but where is this clearly documented? All I can find in the documentation is that global memory transactions are of width 128B.

any observable performance gain may be minimal though

That’s the problem now, that performance differences (on a GV100) are not small:
CopyFloat - 2.4 us, block = 256 x 1 x 1, grid = 504 x 1 x 1
CopyFloat4 - 1.9 us, block = 256 x 1 x 1, grid = 126 x 1 x 1

As long as the GPU is filled with close to the number of warps that can run simultaneously, it doesn’t matter how many total threads there are in the launched grid.

I don’t have the specs for the GV100 in front of me, but it is a very big chip, and your selected problem size may be too small to approach full copy speed. Try copying a couple of GB of memory using a grid containing on the order of 100,000 thread blocks.

Note that I said the performance difference between copying float vs float4 may be small on modern GPUs. In your example the difference is about 25%.

My two main rules of bulk memory copies: (1) Don’t! (2) If you can easily achieve it without major code contortions, use vector types for wide memory access (note increase in alignment restrictions!); otherwise, use whatever the natural size of the data is.

I used the small memcpy kernels just to analyze how memory access happens and figure out how I should or should not write my kernels.

Even for the above mentioned grid/block configuration, there should be enough warps to keep the GPU busy. My GV100 has 80SMs and the kernel is launched for 504/126 blocks.

In the end, what I’m trying to find out is whether or not memory access still happens via 128byte memory transactions, like in the old days. Sure, when using vectorized loads/stores we have less instructions, but if global memory is still accessed via 128 byte memory transactions, than impact on performance should be small.
Is there any documentation explaining the architecture details of this, what happens from the kernel code to the actual memory accesses?