Memory coalescing in one thread

Hi,

if I have a kernel writing multiple values per thread to global memory in sequence, like

int location = (blockIdx.x * blockDim.x + threadIdx.x) * 8;

global[location    ] = value1;

global[location + 1] = value2;

global[location + 2] = value3;

global[location + 3] = value4;

global[location + 4] = value5;

global[location + 5] = value6;

global[location + 6] = value7;

global[location + 7] = value8;

where global memory variable and values are a float3. So each thread writes 8x float3 one after another, and all threads write to consecutive addresses.

My question: Can these memory access inside and between threads be coalesced on a Fermi 2.0 (GeForce GTX 480) device?

In principle, the writes are all consecutively. Each thread writes 8 * 3 * 4byte = 96byte, which could be done in a 64byte and a 32byte memory transaction (for each thread). Or thread 0 could be combined with the first 32byte of the next thread to one complete 128byte transaction, then the remaining 64bytes of thread 1 with the first 64bytes of thread 2 and so on.

My nightmare would be, that each value is written sequentially.

The CUDA Programming Guide isn’t really clear about this.

Thanks for your answers in advance.

basic unit of memory access is per warp, not per thread.

You need to use shared memory to avoid non-coalesced access

Suppose there are 256 threads per threadblock, then

__shared__ float sh[256*3*8] ;

...

const int tid = threadIdx.x;

#pragma unroll

for(int j = 0 ; j < 8 ; j++){

   sh[ tid * 24 + j   ] = value[j].x ;

   sh[ tid * 24 + j+1 ] = value[j].y ;

   sh[ tid * 24 + j+2 ] = value[j].z ;

}

__syncthreads();

int location = blockIdx.x * blockDim.x + threadIdx.x;

for( int j=0 ; j < 8*3 ; j++){

   global[location + j*256] = sh[tid + j*256]; 

}

page 177:

Other accesses larger than 32-bit are split into 32-bit, 64-bit, or 128-bit accesses. The following code, for example:

struct type{ 

float x, y, z; }; 

__shared__ struct type shared[32]; 

struct type data = shared[BaseIndex + tid];

results in three separate 32-bit reads without bank conflicts since each member is accessed with a stride of three 32-bit words.

In your code, threads with with id (4n+m) that share the same m will have bank conflict… and maybe you’ll take 8x as much time. LSChien’s code doesn’t seem to solve the problem for you. Must one set of value(1-8) be computed in a single thread?

Currently your stride between threads is 24-word. As a principle, if the stride is even-word, you will have bank conflicts. A simple solution to your problem, with a bit waste of memory, is padding.

int location = (blockIdx.x * blockDim.x + threadIdx.x) * 9;

global[location ] = value1;

global[location + 1] = value2;

global[location + 2] = value3;

global[location + 3] = value4;

global[location + 4] = value5;

global[location + 5] = value6;

global[location + 6] = value7;

global[location + 7] = value8;

So in this way the stride becomes 3*9 = 27 which is odd, and which means there is no bank conflict

Or you could break your 3-word sized structure into its elements, so that you could save 2 words of space per group of value(1-8)

int location = (blockIdx.x * blockDim.x + threadIdx.x) * 25; //uses 1 extra word only, instead of 3

Other solutions exist, but they may not be as simple and practical as padding.

As the friend before me already explained, bank conflicts is always a matter between threads. When you have 32-bit access, all threads in a warp access together, taking 2 core clocks to finish issuing of instructions. When you have 64-bit access, the warp is divided into two half warps and the instruction is executed over 2 scheduler clocks (4 core clock). In each scheduler clock only instructions for half of the warp are issued. In 128-bit access, the warp is divided into 4 groups.

When you have 64-bit or 128-bit access, you only have to worry about whether within a group there will be bank conflict. And in those cases, the difference between access address of consecutive threads should be odd multiples of 64 bit or 128 bit, otherwise there still will be bank conflicts.

Please forgive me for being so rude, but I really could not stop laughing upon reading your code. :rofl:

A few things I could point out:

  1. shared memory access is subjected to bank conflicts as well.

  2. it appears that more bank conflicts are created in your code than in the original code

First of all thank you both for your replies.

hyqneuron: I think you misunderstood me. I do not want to optimize for shared memory bank conflicts (yet). My concern is global memory access and coalescing multiple global memory accesses into 32-, 64- and 128-byte transactions. As each thread writes a lot of data (96bytes), I think coalsecing is crucial to avoid global memory latency for each access and also to get the maximum bandwidth.

So, if I write for example just a single float3 to global memory, then this access is split by the compiler into three successive float write operations (in ptx code). Are these combined to one memory transaction? Or are they processed independently, as the device executes each write operation in sequence?

That would be bad, because it would mean 24 single float (4byte) write operations for each thread, which can hardly be coalesced, as the stride between threads is 96bytes.

LSChien: I already use shared memory (left it for simplicity), but how does that concern global memory coalescing? In ptx code, the values are read from shared memory into registers first and then written to global memory.

Edit: LSChien, I needed a few looks, but now I see what you are doing. That’s a really interesting idea! The accesses are all coalsced. I will try it tomorrow, thx again.

omg… I am in no way trying to help you deal with shared memory, though the same technique would work equally well for shared memory.

Yes they are executed one after another. You’ll get bank conflicts all the time. Even 64-bit and 128-bit transactions are done on a 32-bit basis, if you understood the division mechanism of a warp into multiple groups.

I apologize for the second point. In LSChein’s code the amount of bank conflict is just as many as your original code. Don’t be surprised when you realize that code takes slightly longer to run than your original code (100 cycles longer? I guess).

oh no, not 100. A lot more, because there is a sync there.

Hey people, sorry for posting incessantly here. I’m excited whenever there’s a chance to learn.

I am very confused by you, vinzenz. It appears, on one hand, that you are dealing with bank conflicts (because you say they can be broken down into 32… 64)… But if you are certain that you are saying 32bytes and 64 bytes, you’re… talking about cache line, perhaps? Anyways, on CC2.0 (GTX480), there is nothing as 32 or 64 byte cache transactions. All are 128 bytes.

In that case, yes, you could reorder your access using shared memory and then write them all to global memory in coalesced manner. However, you may still use padding for your shared memory writes. You would speed up that section by 8 times.

And I have questions too: Are writes performed in 128-byte manner regardless of whether it’s cached or not? Wouldn’t right? So what happens when I write to 32 non-continuous, coalesced addresses that are specified to be stream access? Are they still done at one go in global memory?

Have you been misled by the CUDA best practice guide 3.2? Despite it having the version number 3.2, it is using GTX 8800 and 280 to explain the cache transactions. OMG! That’s why it’s all talking about 64-byte and 32-byte memory transactions. In CC2.0, to which your GTX 480 belong, according to page 164 of CUDA programming guide 3.2, all L1 cache transactions are done in 128 bytes (32 bytes for L2, though).

You are right, Fermi uses cache lines meaning all global memory accesses are 128byte transactions (if cached in L1 and L2). Anyway, with LSChien’s method these accesses can be coalesced, so I was able to speed up my kernel by a factor of nearly 2.

Glad to see that… Maybe I’ll do some tests to check the uncached behaviour

The basic unit of L1/L2 cache is 128 bytes (or say a cahce line), if you disable cache by compiler option, then
transaction size is 32 byte. Please check section F.4.2 in programming guide (4.0RC).

Also you may be interested in talks of Paulius Micikevicius in GTC2010
2011 Fundamental Performance Optimizations for GPUs
http://www.nvidia.com/object/gtc2010-presentation-archive.html

The key is to keep coalesced property but may increase some penalty on shared memory,
including bank-conflict. When bank-conflict occurs, then instruction is replayed.
However if your app is strong memory-bound, then you need to save your bus by any means.
If your app is not memory-bound, then you may lose performance.
For example, if you do matrix multiplication on data type float3,
(you can use operator overloading to re-define multiplication on float3), then
you don’t need to care about coalesced.

In practices, just try all possible solutions and then you find a good one.

Isn’t section F about texture fetch? I do not have 4.0RC and I didn’t find the programming guide in google either. Can you please paste that section here?

Isn’t section F about texture fetch? I do not have 4.0RC and I didn’t find the programming guide in google either. Can you please paste that section here?

F.4.2 in programming guide (RC4.0)

Thanks for the info!

btw, finally got my Registered Developer application approved… haha having 4.0 RC now!!

Also, I feel a bit bad because I could get a maximum of 48K shared memory per MP even though L1 cache can be disabled altogether… Is it that even though L1 and L2 can be disabled, all memory accesses still have to go through them because there is no direct route to access the global memory?