How to understand the alignment of 2D array and fully coalesce of the memory access

If we have a 2D array with 31 rows and 31 columns which is placed continuously, and we want to access the elements of the array.

We have just one block with 31*31 threads.

The accessments can be like this: *(BaseAddress + blockDim.x * threadIdx.y + threadIdx.x) = 1.0;

There are [31*31/32]+1 = 31 warps for the block.

For the first warp, the threads should be t(0,0),…,t(30,0),t(0,1), and for the second warp, the threads are t(1,1),…,t(30,1),t(0,2),t(1,2).

I don’t fully understand why the accesses are not fully coalesced?

Can someone explain it a little bit more?

You’re making a claim, as if it is fact, and then asking others to explain it.
Before you do that, you should be sure that others agree with you as to your “facts”.

Why do you say the accesses are not fully coalesced?

  • is that what you think?
  • is it written somewhere?
  • did you use a tool (e.g. profiler) that seems to be telling you this?

Let’s assume the elements of your array are int or float.

If your threadblock accesses the elements as you say, then there will be 30 warps that access the elements in a fully coalesced way. The 31st warp would only have a single active thread in it. So we would have to be more specific about our access claims for this 31st warp. A full global load transaction would be issued for this warp. However only 4 bytes (one element) would actually be used. This probably doesn’t fit the definition of “fully coalesced”, because when many people use the words “fully coalesced” they mean “full efficiency”. But there is no “uncoalesced” character, even to the access generated by this last warp.

But for the first 30 warps, their access would be fully coalesced, and fully efficient.

efficiency is just (bytes used/bytes requested), for a given warp instructions/transaction

“coalesced” means that the transactions required for each thread in a warp instruction/transaction can be “coalesced” or grouped together in an efficient way. This usually means that they are:

  1. adjacent and contiguous
  2. aligned to an appropriate boundary, e.g. a 128-byte boundary for global load requests

Note that both 1 and 2 are satisfied even for your 31st warp.

(in the above discussion, I am assuming the underlying array allocation is not a pitched allocation)

The profiler metric that can be used to explore this is gld_efficiency (for global loads). The transaction you have actually identified would be a global store transaction, so the efficiency metric for that is gst_efficiency. But note that the profiler often will return confusing results for very small transaction quantities, e.g. one threadblock. It’s easier to understand the profiler output if you launch ~50 threadblocks, which all generate the same patterns.

Thank you very much.

Sorry for the unclear question.

Actually, I am reading the CUDA manual, in the 5.3.2, there is a paragraph “A common global memory access pattern is when each thread of index (tx,ty) uses the following address to access one element of a 2D array of width width, located at address, BaseAddress of type type* (where type meets the requirement described in Maximize Utilization):
BaseAddress + width * ty + tx
For these accesses to be fully coalesced, both the width of the thread block and the width
of the array must be a multiple of the warp size.”

Actually, I don’t understand the manual.

In my opinion, if the elements are accessed by row[i], each row should be aligned to 128 or 256. On the other hand, If the elements are accessed by the baseAddress + offset, it is not necessary to align each row, only the first row should be aligned.

For example, on the page 23 of the c manual, there is a code demonstration:

// Device code
global void MyKernel(float* devPtr,
size_t pitch, int width, int height)
{
for (int r = 0; r < height; ++r) {
float* row = (float*)((char*)devPtr + r * pitch);
for (int c = 0; c < width; ++c) {
float element = row[c];
}
}
}

But, I don’t understand the code. As demonstrated above, each thread accesses all the elements one by one, all the threads just need the same element, is it necessary to pad each row?

I can only understand that if 32 threads want to access 32 different elements which are placed together and started from a aligned row, they can be dealt together efficiently.

Can you help me with that?

So, hopefully someone will correct me if I’m wrong about this stuff but…

In CUDA, a warp a single processor cycle where the same instruction from a set of threads is executed in lockstep and the number happens to be 32.

Global loads are done in 128 byte words.

Coalescence means, all 32 threads are pulling in all 128 bytes. If you access a global array of any 4 byte type with the thread id as the array index, you’ll get a coalesced read.

But what really matters is global load efficiency. Global load efficiency is the ratio of the amount of memory pulled in by a warp vs the theoretical maximum, 128 bytes.

What I mean by this is, if you have a float3 type, that’s a 12 byte type. This means that only 10 threads at the most can be reading in these types (12 * 10 = 120 => load efficiency = 120 / 128). This also means that reading in a segment splits the execution up into 3 separate warps as well!

But sometimes, you gotta pull in a lot of data and a non-perfect load efficiency is okay.

That being said, all threads reading in from the same location is fine because the reads can all be done in a single warp. No need to pad. Just make sure that for a warp, all the data is in a contiguous 128 byte block.

Edit:

And stick to 1d storage and then use 2d indexing to read from it.

The manual assumes the general case where you have a 2D array of some arbitrary size (i.e. width), and you have multiple 2D threadblocks in a 2D grid.

In that case, the accesses would not properly coalesce. For example, the first warp would choose 31 elements from the first line of the 2D array and 1 element from the second. These would typically not be adjacent.

Your specific example, of one block, accessing a single array of size 31x31, is a special case where width = blockDim.x In that case, you would have coalescing as I described. But in the general case covered by the manual, you would not.

You might think your original question was just a restatement of what is in the manual. It is not.

The code block you are now quoting from the manual covers a pitched array. Your original question did not (at least my first answer assumed it did not, you gave no indication of pitching).

  1. L2 cache has line of 32 bytes, L1 cache has line of 128 bytes. Modern GPUs (at least Maxwell+) usually don’t store global data in L1 cache, so today you may assume that minimum load chunk is 32 bytes

  2. Warp contains 32 threads. Each thread request data from some address, and for each address, entire chunk (i.e. 32 bytes) is read

  3. Fully coalesced just means that you read only 4 such chunks for the entire warp, i.e. use the memory throughput in optimal way

Grateful for your reply.

Following your comment, the (tx, ty) should be understand as:
(blockIdx.xblockDim.x+threadIdx.x, blockIdy.yblockDim.y+threadIdx.y), is that correct?

In addition, I want to know the instruction throughput of the manual code:

// Device code
global void MyKernel(float* devPtr, size_t pitch, int width, int height)
{
for (int r = 0; r < height; ++r) {
float* row = (float*)((char*)devPtr + r * pitch);
for (int c = 0; c < width; ++c) {
float element = row[c];
}
}
}

All the 32 threads of a warp access the same element (4 bytes) in the global memory. If the memory transaction is 32 bytes for one threads, there is no additional memory transaction needed. Each thread of a warp can obtain desired memory (the same one), so the warp can obtain 32*4 bytes from just one 32-bytes memory transaction.

In my opinion, the instruction throughput should be multiplied by 4, is it correct?

Thanks a lot.

It’s a very clear reply.

I understand you comment as: Each thread should search the L2 first.If the thread can find the needed memory, the refreshment of the L2, which leads to another memory transaction, is not needed. Is it correct?