Memory coalescing in one thread

vinzenz · March 27, 2011, 11:51pm

Hi,

if I have a kernel writing multiple values per thread to global memory in sequence, like

int location = (blockIdx.x * blockDim.x + threadIdx.x) * 8;

global[location    ] = value1;

global[location + 1] = value2;

global[location + 2] = value3;

global[location + 3] = value4;

global[location + 4] = value5;

global[location + 5] = value6;

global[location + 6] = value7;

global[location + 7] = value8;

where global memory variable and values are a float3. So each thread writes 8x float3 one after another, and all threads write to consecutive addresses.

My question: Can these memory access inside and between threads be coalesced on a Fermi 2.0 (GeForce GTX 480) device?

In principle, the writes are all consecutively. Each thread writes 8 * 3 * 4byte = 96byte, which could be done in a 64byte and a 32byte memory transaction (for each thread). Or thread 0 could be combined with the first 32byte of the next thread to one complete 128byte transaction, then the remaining 64bytes of thread 1 with the first 64bytes of thread 2 and so on.

My nightmare would be, that each value is written sequentially.

The CUDA Programming Guide isn’t really clear about this.

Thanks for your answers in advance.

LSChien · March 28, 2011, 9:03am

basic unit of memory access is per warp, not per thread.

You need to use shared memory to avoid non-coalesced access

Suppose there are 256 threads per threadblock, then

__shared__ float sh[256*3*8] ;

...

const int tid = threadIdx.x;

#pragma unroll

for(int j = 0 ; j < 8 ; j++){

   sh[ tid * 24 + j   ] = value[j].x ;

   sh[ tid * 24 + j+1 ] = value[j].y ;

   sh[ tid * 24 + j+2 ] = value[j].z ;

}

__syncthreads();

int location = blockIdx.x * blockDim.x + threadIdx.x;

for( int j=0 ; j < 8*3 ; j++){

   global[location + j*256] = sh[tid + j*256]; 

}

hyqneuron · March 28, 2011, 11:18am

page 177:

Other accesses larger than 32-bit are split into 32-bit, 64-bit, or 128-bit accesses. The following code, for example:

struct type{ 

float x, y, z; }; 

__shared__ struct type shared[32]; 

struct type data = shared[BaseIndex + tid];

results in three separate 32-bit reads without bank conflicts since each member is accessed with a stride of three 32-bit words.

In your code, threads with with id (4n+m) that share the same m will have bank conflict… and maybe you’ll take 8x as much time. LSChien’s code doesn’t seem to solve the problem for you. Must one set of value(1-8) be computed in a single thread?

hyqneuron · March 28, 2011, 1:41pm

Currently your stride between threads is 24-word. As a principle, if the stride is even-word, you will have bank conflicts. A simple solution to your problem, with a bit waste of memory, is padding.

int location = (blockIdx.x * blockDim.x + threadIdx.x) * 9;

global[location ] = value1;

global[location + 1] = value2;

global[location + 2] = value3;

global[location + 3] = value4;

global[location + 4] = value5;

global[location + 5] = value6;

global[location + 6] = value7;

global[location + 7] = value8;

So in this way the stride becomes 3*9 = 27 which is odd, and which means there is no bank conflict

Or you could break your 3-word sized structure into its elements, so that you could save 2 words of space per group of value(1-8)

int location = (blockIdx.x * blockDim.x + threadIdx.x) * 25; //uses 1 extra word only, instead of 3

Other solutions exist, but they may not be as simple and practical as padding.

hyqneuron · March 28, 2011, 2:05pm

As the friend before me already explained, bank conflicts is always a matter between threads. When you have 32-bit access, all threads in a warp access together, taking 2 core clocks to finish issuing of instructions. When you have 64-bit access, the warp is divided into two half warps and the instruction is executed over 2 scheduler clocks (4 core clock). In each scheduler clock only instructions for half of the warp are issued. In 128-bit access, the warp is divided into 4 groups.

When you have 64-bit or 128-bit access, you only have to worry about whether within a group there will be bank conflict. And in those cases, the difference between access address of consecutive threads should be odd multiples of 64 bit or 128 bit, otherwise there still will be bank conflicts.

hyqneuron · March 28, 2011, 2:18pm

basic unit of memory access is per warp, not per thread.

You need to use shared memory to avoid non-coalesced access

Suppose there are 256 threads per threadblock, then

__shared__ float sh[256*3*8] ;

...

const int tid = threadIdx.x;

#pragma unroll

for(int j = 0 ; j < 8 ; j++){

   sh[ tid * 24 + j   ] = value[j].x ;

   sh[ tid * 24 + j+1 ] = value[j].y ;

   sh[ tid * 24 + j+2 ] = value[j].z ;

}

__syncthreads();

int location = blockIdx.x * blockDim.x + threadIdx.x;

for( int j=0 ; j < 8*3 ; j++){

   global[location + j*256] = sh[tid + j*256]; 

}

Please forgive me for being so rude, but I really could not stop laughing upon reading your code. External Image

A few things I could point out:

shared memory access is subjected to bank conflicts as well.
it appears that more bank conflicts are created in your code than in the original code

vinzenz · March 28, 2011, 9:50pm

First of all thank you both for your replies.

hyqneuron: I think you misunderstood me. I do not want to optimize for shared memory bank conflicts (yet). My concern is global memory access and coalescing multiple global memory accesses into 32-, 64- and 128-byte transactions. As each thread writes a lot of data (96bytes), I think coalsecing is crucial to avoid global memory latency for each access and also to get the maximum bandwidth.

So, if I write for example just a single float3 to global memory, then this access is split by the compiler into three successive float write operations (in ptx code). Are these combined to one memory transaction? Or are they processed independently, as the device executes each write operation in sequence?

That would be bad, because it would mean 24 single float (4byte) write operations for each thread, which can hardly be coalesced, as the stride between threads is 96bytes.

LSChien: I already use shared memory (left it for simplicity), but how does that concern global memory coalescing? In ptx code, the values are read from shared memory into registers first and then written to global memory.

Edit: LSChien, I needed a few looks, but now I see what you are doing. That’s a really interesting idea! The accesses are all coalsced. I will try it tomorrow, thx again.

hyqneuron · March 29, 2011, 12:46am

omg… I am in no way trying to help you deal with shared memory, though the same technique would work equally well for shared memory.

Yes they are executed one after another. You’ll get bank conflicts all the time. Even 64-bit and 128-bit transactions are done on a 32-bit basis, if you understood the division mechanism of a warp into multiple groups.

I apologize for the second point. In LSChein’s code the amount of bank conflict is just as many as your original code. Don’t be surprised when you realize that code takes slightly longer to run than your original code (100 cycles longer? I guess).

hyqneuron · March 29, 2011, 12:47am

oh no, not 100. A lot more, because there is a sync there.

hyqneuron · March 29, 2011, 2:59am

Hey people, sorry for posting incessantly here. I’m excited whenever there’s a chance to learn.

I am very confused by you, vinzenz. It appears, on one hand, that you are dealing with bank conflicts (because you say they can be broken down into 32… 64)… But if you are certain that you are saying 32bytes and 64 bytes, you’re… talking about cache line, perhaps? Anyways, on CC2.0 (GTX480), there is nothing as 32 or 64 byte cache transactions. All are 128 bytes.

In that case, yes, you could reorder your access using shared memory and then write them all to global memory in coalesced manner. However, you may still use padding for your shared memory writes. You would speed up that section by 8 times.

And I have questions too: Are writes performed in 128-byte manner regardless of whether it’s cached or not? Wouldn’t right? So what happens when I write to 32 non-continuous, coalesced addresses that are specified to be stream access? Are they still done at one go in global memory?

hyqneuron · March 29, 2011, 5:29am

Have you been misled by the CUDA best practice guide 3.2? Despite it having the version number 3.2, it is using GTX 8800 and 280 to explain the cache transactions. OMG! That’s why it’s all talking about 64-byte and 32-byte memory transactions. In CC2.0, to which your GTX 480 belong, according to page 164 of CUDA programming guide 3.2, all L1 cache transactions are done in 128 bytes (32 bytes for L2, though).

vinzenz · March 29, 2011, 9:54pm

You are right, Fermi uses cache lines meaning all global memory accesses are 128byte transactions (if cached in L1 and L2). Anyway, with LSChien’s method these accesses can be coalesced, so I was able to speed up my kernel by a factor of nearly 2.

hyqneuron · March 30, 2011, 12:30am

Glad to see that… Maybe I’ll do some tests to check the uncached behaviour

LSChien · March 30, 2011, 2:24am

The basic unit of L1/L2 cache is 128 bytes (or say a cahce line), if you disable cache by compiler option, then
transaction size is 32 byte. Please check section F.4.2 in programming guide (4.0RC).

Also you may be interested in talks of Paulius Micikevicius in GTC2010
2011 Fundamental Performance Optimizations for GPUs
http://www.nvidia.com/object/gtc2010-presentation-archive.html

The key is to keep coalesced property but may increase some penalty on shared memory,
including bank-conflict. When bank-conflict occurs, then instruction is replayed.
However if your app is strong memory-bound, then you need to save your bus by any means.
If your app is not memory-bound, then you may lose performance.
For example, if you do matrix multiplication on data type float3,
(you can use operator overloading to re-define multiplication on float3), then
you don’t need to care about coalesced.

In practices, just try all possible solutions and then you find a good one.

hyqneuron · March 30, 2011, 3:55am

Isn’t section F about texture fetch? I do not have 4.0RC and I didn’t find the programming guide in google either. Can you please paste that section here?

hyqneuron · March 30, 2011, 3:55am

Isn’t section F about texture fetch? I do not have 4.0RC and I didn’t find the programming guide in google either. Can you please paste that section here?

LSChien · March 30, 2011, 4:32am

F.4.2 in programming guide (RC4.0)

hyqneuron · March 31, 2011, 9:49am

Thanks for the info!

btw, finally got my Registered Developer application approved… haha having 4.0 RC now!!

Also, I feel a bit bad because I could get a maximum of 48K shared memory per MP even though L1 cache can be disabled altogether… Is it that even though L1 and L2 can be disabled, all memory accesses still have to go through them because there is no direct route to access the global memory?

Topic		Replies	Views
Using Shared Memory in CUDA C/C++ Technical Blog	36	1954	October 8, 2020
char global memory access optimization CUDA Programming and Performance	17	11865	May 31, 2010
Coalesced Memory access related doubt CUDA Programming and Performance	13	1972	December 9, 2010
Local memory performance Using more than 4kb kills it.. why? CUDA Programming and Performance	24	5074	September 6, 2008
Question on the L1 caching of the GK 110 CUDA Programming and Performance	17	7143	April 17, 2013
Shared memory question CUDA Programming and Performance	27	7316	June 23, 2008
Effective global memory bandwidth? CUDA Programming and Performance	17	17562	September 18, 2007
Memory coalescing and multiple arrays CUDA Programming and Performance	23	11654	March 20, 2009
Please help with __shared__ memory different usage than in samples CUDA Programming and Performance	30	3307	January 10, 2010
Even more Fermi Fun: Uncoalesced writes CUDA Programming and Performance	8	8886	June 5, 2010

Memory coalescing in one thread

Related topics