Why using vectorized loads is more efficient?

I read this article: https://developer.nvidia.com/blog/cuda-pro-tip-increase-performance-with-vectorized-memory-access and also did some tests with simple memcpy kernels accessing global memory either via float or via float4 reads/writes. Vectorized access gives the best performance. But I don’t understand why.

Here are the two kernels that I am comparing. CopyFloat4 is launched with 4 times less threads than CopyFloat.

global void CopyFloat(float* restrict dest, const float* restrict src)
{
int index = blockDim.x * blockIdx.x + threadIdx.x;

dest[index] = src[index];

}

global void CopyFloat4(float4* restrict dest, const float4* restrict src)
{
int index = blockDim.x * blockIdx.x + threadIdx.x;

dest[index] = src[index];

}

Since the scheduling is done at warp granularity, we need to look at how data accesses happen within a warp.

CopyFloat: each thread within a warp requests 4bytes => coalesced into 1x128byte memory transaction per warp
=> Total memory transactions: Thread_count / 32 = Element_count / 32
CopyFloat4: each thread within a warp requests 4x4bytes => 4x128byte memory transactions per warp
=> Total memory transactions: Thread_count / 32 * 4 = (Element_count / 4) / 32 * 4 = Element_count/32

So, for CopyFloat4, a warp will issue 4 times more memory transactions. But we have 4 times less warps, hence, in the end, we have the same amount of memory transactions for both kernels.

What am I missing, why is the vectorized kernel faster?

But we have 4 times less warps

Why?

Usually, when looking at performance, we want to do controlled experiments, in which just a single variable changes. So using a given piece of code, with a given run-time configuration, and exchanging narrow loads for wide loads (single variable change!), the wider loads will be more efficient in terms of hardware usage (fewer instructions fetched and executed, fewer memory transaction queued), and this is likely beneficial to performance. From my understanding, the way modern GPUs work (basically grouping narrower accesses together), any observable performance gain may be minimal though.

But we have 4 times less warps

Why?

We have a fixed amount of bytes that we copy with either CopyFloat or CopyFloat4. So, when using CopyFloat4, the kernel does 4 times more work, hence we will have 4 times less threads, 4 times less warps.

True, with vectorized memory accesses we have fewer instructions fetched and executed. However, instruction fetching/scheduling/execution is considerably less costly than actually doing reads/writes from global memory.

the wider loads will be more efficient in terms of hardware usage

Yes, probably that’s the explanation, but where is this clearly documented? All I can find in the documentation is that global memory transactions are of width 128B.

any observable performance gain may be minimal though

That’s the problem now, that performance differences (on a GV100) are not small:
CopyFloat - 2.4 us, block = 256 x 1 x 1, grid = 504 x 1 x 1
CopyFloat4 - 1.9 us, block = 256 x 1 x 1, grid = 126 x 1 x 1

As long as the GPU is filled with close to the number of warps that can run simultaneously, it doesn’t matter how many total threads there are in the launched grid.

I don’t have the specs for the GV100 in front of me, but it is a very big chip, and your selected problem size may be too small to approach full copy speed. Try copying a couple of GB of memory using a grid containing on the order of 100,000 thread blocks.

Note that I said the performance difference between copying float vs float4 may be small on modern GPUs. In your example the difference is about 25%.

My two main rules of bulk memory copies: (1) Don’t! (2) If you can easily achieve it without major code contortions, use vector types for wide memory access (note increase in alignment restrictions!); otherwise, use whatever the natural size of the data is.

I used the small memcpy kernels just to analyze how memory access happens and figure out how I should or should not write my kernels.

Even for the above mentioned grid/block configuration, there should be enough warps to keep the GPU busy. My GV100 has 80SMs and the kernel is launched for 504/126 blocks.

In the end, what I’m trying to find out is whether or not memory access still happens via 128byte memory transactions, like in the old days. Sure, when using vectorized loads/stores we have less instructions, but if global memory is still accessed via 128 byte memory transactions, than impact on performance should be small.
Is there any documentation explaining the architecture details of this, what happens from the kernel code to the actual memory accesses?

For CopyFloat4, why 4 transactions are generated per warp? In your example, CopyFloat assumes all threads within a warp know the data to be fetched laying together, so one transaction for a warp. Why these threads cannot know that in CopyFloat4 situation?

If I am on the right way, the total number of transactions for a warp in CopyFloat4 is 1 as well. As such, the total number of warps becomes a quarter of that before, thus reducing the total number of transactions to a quarter.

“Fewer transaction requests reduce memory controller contention and can produce higher overall memory bandwidth utilization.” ref. StackOverflow

The GPU memory controller will not issue a transaction that is 512 bytes wide. It breaks that into four 128 byte transactions. This isn’t a characteristic of the code, or of CUDA, but of the GPU memory controller.

All other things being equal, fewer transactions per request generally indicates higher efficiency. But there is no loss of efficiency when a single, coalesced 512 byte request is broken into multiple coalesced transactions. The memory bus utilization efficiency is the same, and the impact to the number of transactions in flight is unavoidable; it is the nature of current GPU hardware, and in this specific comparison, is no worse than the non-vectorized load case.

Fewer transactions per request generally indicates higher efficiency? Are you sure? Why?

I tried to verify my previous statements and found that under nvprof (where the “transactions per request” has an associated metric), the behavior of the metric varies in a confusing way between Fermi, Kepler, and Maxwell. This is very old stuff so I’m unwilling to try and spell it out here.

For a more modern treatment of the way I think about how to measure this kind of efficiency, see here.

I’ve edited my previous post a bit to remove the confusing statement(s).