How warp serialization works on shared memory How to run a "data[n] += something" efficientl

HenrikAndresen · May 25, 2010, 10:23pm

Hello

I’m looking at the output of my profiler, and for my kernel I have a rather high warp serialization. I have identified the code to be the one operating on my shared memory buffer. As far as I understand, this happens if two threads in a block requires the same data. To ease it all, I have written a small piece of code that illustrates my point.

Now, I run this on a 9600GT card, windows xp32, and with the 3.0 toolkit. I compile to the sm11 with speed optimization.

The algebra is simply to have something to do, and doesn’t matter. What is critical is the “+=” part of the code. Ignore any 16/32-bit integer misuse. See below…

[codebox]

global void testAddIterative( __int16 * dpOutput )

{

__shared__ int resultBuffer[2048];

int nThread = threadIdx.x + blockDim.x * threadIdx.y;	

int nBlock = blockIdx.x + blockIdx.y * gridDim.y;

int nTotalThreads = blockDim.x * blockDim.y;

	

int nOffset = 8 * nThread;

int nSub;

for( int nCount = 0; nCount < 500; nCount++ ){

	for( nSub = 0; nSub < 8; nSub++ ){

		resultBuffer[nOffset + nSub] += 2 * rintf( __fsqrt_rn( 5.0f + (float)nCount )); //Critical part

	}

}

int nTotal = 0;

for( nSub = 0; nSub < 8; nSub++ ){

	nTotal += resultBuffer[nOffset+nSub];		

}	

dpOutput[nThread] = nTotal;

}

[/codebox]

When I run my profiler, I get the following profiler counters for the kernel:

Instructions = 66%

Warp Serialize = 31%

The code runs using a (32,8) block with a (4,64) grid. The output buffer is just a dummy.

I understand that the memory does not allow me to read from the same position at once. But the above code does not do that. Each set of 8 indexes is only touched by a single thread. It reads once, adds, writes. It can even put the arithmetic in between to help with latency if at all needed. What am I missing here?

Help is greatly appreciated.

Cheers

Henrik Andresen

avidday · May 25, 2010, 10:32pm

That isn’t the most common source of warp serialization related to shared memory. Shared memory is striped across 16 32bit wide banks, each bank holding 256 words (ie 1kb). Whenever more than one thread in a half warp of threads needs to read from the same bank, you have a bank conflict and that causes serialization. This can happen because of the size of the type or the pitch at which data is stored in shared memory.

avidday · May 25, 2010, 10:32pm

That isn’t the most common source of warp serialization related to shared memory. Shared memory is striped across 16 32bit wide banks, each bank holding 256 words (ie 1kb). Whenever more than one thread in a half warp of threads needs to read from the same bank, you have a bank conflict and that causes serialization. This can happen because of the size of the type or the pitch at which data is stored in shared memory.

HenrikAndresen · May 26, 2010, 7:30am

Ahh, that makes my results much clearer! Thank you.

So, if I have 256 threads reading and writing over an 8k area (evenly spread), can I have them do coalesced reads to make it quicker, or is it simply the way it is? I don’t know if coalescence is effective in shared memory at all.

Edit: If I spread the threads from a half warp over as much of the shared memory as possible, would this reduce serialization?

Thanks

HenrikAndresen · May 26, 2010, 7:30am

Ahh, that makes my results much clearer! Thank you.

So, if I have 256 threads reading and writing over an 8k area (evenly spread), can I have them do coalesced reads to make it quicker, or is it simply the way it is? I don’t know if coalescence is effective in shared memory at all.

Edit: If I spread the threads from a half warp over as much of the shared memory as possible, would this reduce serialization?

Thanks

tera · May 26, 2010, 8:27am

There is no such thing as coalescing in shared memory. However, both shared and global memory are optimized for the same preferred access pattern: Threads accessing adjacent words in memory.

So you don’t want to spread the threads in shared memory, but (usually) keep them close close together.

Just rearrange your threads to avoid the bank conflicts:

[codebox]

global void testAddIterative( __int16 * dpOutput )

{

__shared__ int resultBuffer[2304];

int nThread = threadIdx.x + blockDim.x * threadIdx.y;	

int nBlock = blockIdx.x + blockIdx.y * gridDim.y;

int nTotalThreads = blockDim.x * blockDim.y;

	

int nOffset = 9 * nThread;

for(int nSub = 0; nSub < 9; nSub++ ) {

	resultBuffer[nOffset + nSub] += 0;

}

for( int nCount = 0; nCount < 500; nCount++ ) {

	for(int nSub = 0; nSub < 9; nSub++ ) {

		resultBuffer[nOffset + nSub] += 2 * rintf( __fsqrt_rn( 5.0f + (float)nCount )); //Critical part

	}

}

int nTotal = 0;

for( nSub = 0; nSub < 9; nSub++ ) {

	nTotal += resultBuffer[nOffset+nSub];		

}	

dpOutput[nThread] = nTotal;

}

[/codebox]

tera · May 26, 2010, 8:27am

There is no such thing as coalescing in shared memory. However, both shared and global memory are optimized for the same preferred access pattern: Threads accessing adjacent words in memory.

So you don’t want to spread the threads in shared memory, but (usually) keep them close close together.

Just rearrange your threads to avoid the bank conflicts:

[codebox]

global void testAddIterative( __int16 * dpOutput )

{

__shared__ int resultBuffer[2304];

int nThread = threadIdx.x + blockDim.x * threadIdx.y;	

int nBlock = blockIdx.x + blockIdx.y * gridDim.y;

int nTotalThreads = blockDim.x * blockDim.y;

	

int nOffset = 9 * nThread;

for(int nSub = 0; nSub < 9; nSub++ ) {

	resultBuffer[nOffset + nSub] += 0;

}

for( int nCount = 0; nCount < 500; nCount++ ) {

	for(int nSub = 0; nSub < 9; nSub++ ) {

		resultBuffer[nOffset + nSub] += 2 * rintf( __fsqrt_rn( 5.0f + (float)nCount )); //Critical part

	}

}

int nTotal = 0;

for( nSub = 0; nSub < 9; nSub++ ) {

	nTotal += resultBuffer[nOffset+nSub];		

}	

dpOutput[nThread] = nTotal;

}

[/codebox]

wanderine · May 26, 2010, 9:11am

I have a similar problem for my convolution kernels, I store some data in the shared memory and the filter in constant memory. Each thread calculates the convolution result for one pixel or voxel. I do “normal” convolution and not separable convolution.

I make a for-loop like this, but I do the unrolling myself to get better performance

x_offset = -(FILTER_SIZE - 1)/2;

for (int FILTER_X = FILTER_SIZE - 1; FILTER_X >= 0; FILTER_X–)

{

y_offset = -(FILTER_SIZE - 1)/2;

for (int FILTER_Y = FILTER_SIZE - 1; FILTER_Y >= 0; FILTER_Y--)

{

      sum += s_Image[threadIdx.x + x_offset][threadIdx.y + y_offset] * c_Filter[FILTER_X][FILTER_Y];

      y_offset++;

}

x_offset++;

}

I get a lot of warp serialization since the threads read at pixels that are close to each other (i.e. same banks in shared memory), how can I rewrite the for-loop to avoid this?

I store floats in shared memory, shared float s_Image[64][63]

wanderine · May 26, 2010, 9:11am

I have a similar problem for my convolution kernels, I store some data in the shared memory and the filter in constant memory. Each thread calculates the convolution result for one pixel or voxel. I do “normal” convolution and not separable convolution.

I make a for-loop like this, but I do the unrolling myself to get better performance

x_offset = -(FILTER_SIZE - 1)/2;

for (int FILTER_X = FILTER_SIZE - 1; FILTER_X >= 0; FILTER_X–)

{

y_offset = -(FILTER_SIZE - 1)/2;

for (int FILTER_Y = FILTER_SIZE - 1; FILTER_Y >= 0; FILTER_Y--)

{

      sum += s_Image[threadIdx.x + x_offset][threadIdx.y + y_offset] * c_Filter[FILTER_X][FILTER_Y];

      y_offset++;

}

x_offset++;

}

I get a lot of warp serialization since the threads read at pixels that are close to each other (i.e. same banks in shared memory), how can I rewrite the for-loop to avoid this?

I store floats in shared memory, shared float s_Image[64][63]

wanderine · May 26, 2010, 9:35am

When I look at the CUDA profiler output I seem to have NO warp serialization for my 2D convolution, but a lot for my 3D convolution, can anyone explain that?

For 3D I use shared memory as

shared float s_Volume[16][16][15]

wanderine · May 26, 2010, 9:35am

When I look at the CUDA profiler output I seem to have NO warp serialization for my 2D convolution, but a lot for my 3D convolution, can anyone explain that?

For 3D I use shared memory as

shared float s_Volume[16][16][15]

HenrikAndresen · May 26, 2010, 10:00am

Hey Tera, thank you for your reply.

I’m not understanding what you write completely though. If I access adjacent words in memory with my code similar to when I read coalesced from global memory, won’t I get serialization due to bank conflicts? And if I spread my threads to different banks, how big is my penalty to memory access compared to the latency of serialization?

Assuming I have a [32,8] number of threads in my block, that will mean that the first 32 threads (one column) will be two half-warps. Would I get the best performance in having thread 1-16 read address p[0],p[1],…,p[15], or should they read p[0], p[256], p[512], …, p[4096] ? (Assuming fermi for enough shared memory, and using 4-byte values). This would allow each thread to access a new bank, but spread my reads. I would also still have the other half-warp reading from p[1], p[257], p[513], …, p[4097], but this would be ok then? And not giving me a bank penalty?

Is this covered in any documents from nVidia? Because then I have missed it.

Cheers

HenrikAndresen · May 26, 2010, 10:00am

Hey Tera, thank you for your reply.

I’m not understanding what you write completely though. If I access adjacent words in memory with my code similar to when I read coalesced from global memory, won’t I get serialization due to bank conflicts? And if I spread my threads to different banks, how big is my penalty to memory access compared to the latency of serialization?

Assuming I have a [32,8] number of threads in my block, that will mean that the first 32 threads (one column) will be two half-warps. Would I get the best performance in having thread 1-16 read address p[0],p[1],…,p[15], or should they read p[0], p[256], p[512], …, p[4096] ? (Assuming fermi for enough shared memory, and using 4-byte values). This would allow each thread to access a new bank, but spread my reads. I would also still have the other half-warp reading from p[1], p[257], p[513], …, p[4097], but this would be ok then? And not giving me a bank penalty?

Is this covered in any documents from nVidia? Because then I have missed it.

Cheers

tera · May 26, 2010, 10:11am

You seem to have the same misconception about bank layout: words that are close to each other are not in the same bank! Banks are interleaved, so adjacent words in shared memory fall into different banks.

You can probably remove the bank conflicts by reversing the thread indices. Address your shared memory through s_Image[threadIdx.z + z_offset][threadIdx.y + y_offset][threadIdx.x + x_offset] or s_Image[threadIdx.y + y_offset][threadIdx.x + x_offset][threadIdx.z + z_offset].

tera · May 26, 2010, 10:11am

You seem to have the same misconception about bank layout: words that are close to each other are not in the same bank! Banks are interleaved, so adjacent words in shared memory fall into different banks.

You can probably remove the bank conflicts by reversing the thread indices. Address your shared memory through s_Image[threadIdx.z + z_offset][threadIdx.y + y_offset][threadIdx.x + x_offset] or s_Image[threadIdx.y + y_offset][threadIdx.x + x_offset][threadIdx.z + z_offset].

tera · May 26, 2010, 10:22am

No, the banks are interleaved: Bank 0 contains p[0], p[16], p[32], p[48], … Bank 1 holds p[1], p[17], … on compute capability 1.x devices (Fermi has 32 banks).

So you want to access adjacent words just because you get no bank conflicts then. Serialization is a direct consequence of the bank conflicts, it’s not a different concept.

You definitely want to access p[0],p[1],…,p[15] in your half-warp so that each thread reads from a different bank (no bank conflict). Reading from p[0], p[256], p[512], …, p[4096] would produce the worst case, a 16-way bank conflict, as all reads go to the same bank.

It is covered both in the Programming Guide and in the Best Practices Guide.

tera · May 26, 2010, 10:22am

No, the banks are interleaved: Bank 0 contains p[0], p[16], p[32], p[48], … Bank 1 holds p[1], p[17], … on compute capability 1.x devices (Fermi has 32 banks).

So you want to access adjacent words just because you get no bank conflicts then. Serialization is a direct consequence of the bank conflicts, it’s not a different concept.

You definitely want to access p[0],p[1],…,p[15] in your half-warp so that each thread reads from a different bank (no bank conflict). Reading from p[0], p[256], p[512], …, p[4096] would produce the worst case, a 16-way bank conflict, as all reads go to the same bank.

It is covered both in the Programming Guide and in the Best Practices Guide.

HenrikAndresen · May 26, 2010, 11:00am

Ahh!!! (small light bulb on top of my head)

Thank you very much! It helped a lot.

Cheers!

HenrikAndresen · May 26, 2010, 11:00am

Ahh!!! (small light bulb on top of my head)

Thank you very much! It helped a lot.

Cheers!

wanderine · May 26, 2010, 11:12am

The first way is how I do the convolution…(I guess that you mean s_Volume and not s_Image), will try to flip x and y but does it really make any difference if I use a thread configuration that is

dimBlock = dim3(8,8,7) ?

Does it matter if I store the data like

shared float s_Volume[16][15][16]

instead of

shared float s_Volume[16][16][15]

?