Shared memory access patterns

Johannes_Niedermaier · March 3, 2010, 1:18am

Dear all,

I have a question concerning shared memory access patterns. The following two kernels resolve the same problem. Both kernels access global memory in a coalesced manner. The fundamental difference is the data layout in shared memory. Kernel 1 has been designed in order to minimize shared memory bank conflicts. As a matter of fact, kernel 1 is conflict free with respect to shared memory accesses. Kernel 2 employs much easier shared memory access patterns. As a consequence, up to 16-way bank conflicts are generated in kernel 2.

Source code kernel 1:

[codebox]global void dictSearchGPU(long int *g_Norm, long int *g_X, long int *g_D){

//Set offsets

unsigned int normThreadId = threadIdx.x & 15;

unsigned int threadGroup = (threadIdx.x - normThreadId) >> 4;

unsigned int offset = 65 * threadGroup;

//Determine component positions within vectors

unsigned int sharedVectorPos = offset + normThreadId;

unsigned int globalVectorPos = 1024 * blockIdx.x + 64 * threadGroup + normThreadId;

//(X_i - S_i_j)^2 calculation

__shared__ long int s_D[1039];

s_D[sharedVectorPos] = g_X[normThreadId] - g_D[globalVectorPos];

s_D[sharedVectorPos] = __mul24(s_D[sharedVectorPos], s_D[sharedVectorPos]);

s_D[sharedVectorPos + 16] = g_X[normThreadId + 16] - g_D[globalVectorPos + 16];

s_D[sharedVectorPos + 16] = __mul24(s_D[sharedVectorPos + 16], s_D[sharedVectorPos + 16]);

s_D[sharedVectorPos + 32] = g_X[normThreadId + 32] - g_D[globalVectorPos + 32];

s_D[sharedVectorPos + 32] = __mul24(s_D[sharedVectorPos + 32], s_D[sharedVectorPos + 32]);

s_D[sharedVectorPos + 48] = g_X[normThreadId + 48] - g_D[globalVectorPos + 48];

s_D[sharedVectorPos + 48] = __mul24(s_D[sharedVectorPos + 48], s_D[sharedVectorPos + 48]);

//Reduction

__syncthreads();

s_D[sharedVectorPos] += s_D[sharedVectorPos + 16];

s_D[sharedVectorPos + 32] += s_D[sharedVectorPos + 48];

s_D[sharedVectorPos] += s_D[sharedVectorPos + 32];

if(normThreadId < 8)

	s_D[sharedVectorPos] += s_D[sharedVectorPos + 8];

if(normThreadId < 4)

	s_D[sharedVectorPos] += s_D[sharedVectorPos + 4];

if(normThreadId < 2)

	s_D[sharedVectorPos] += s_D[sharedVectorPos + 2];

if(normThreadId == 0)

	s_D[offset] += s_D[offset + 1];

//Transfer sum results from shared to global memory

__syncthreads();

if(threadIdx.x < 16)

	g_Norm[16 * blockIdx.x + threadIdx.x] = s_D[65 * threadIdx.x];

}[/codebox]

Source code kernel 2:

[codebox]global void dictSearchGPU(long int *g_Norm, long int *g_X, long int *g_D){

//Determine component positions within vectors

unsigned int XIndex = threadIdx.x & 63;

unsigned int globalVectorPos = 1024 * blockIdx.x + threadIdx.x;

//(X_i - S_i_j)^2 calculation

__shared__ long int s_D[1024];

s_D[threadIdx.x] = g_X[XIndex] - g_D[globalVectorPos];

s_D[threadIdx.x] = __mul24(s_D[threadIdx.x], s_D[threadIdx.x]);

s_D[threadIdx.x + 256] = g_X[XIndex] - g_D[globalVectorPos + 256];

s_D[threadIdx.x + 256] = __mul24(s_D[threadIdx.x + 256], s_D[threadIdx.x + 256]);

s_D[threadIdx.x + 512] = g_X[XIndex] - g_D[globalVectorPos + 512];

s_D[threadIdx.x + 512] = __mul24(s_D[threadIdx.x + 512], s_D[threadIdx.x + 512]);

s_D[threadIdx.x + 768] = g_X[XIndex] - g_D[globalVectorPos + 768];

s_D[threadIdx.x + 768] = __mul24(s_D[threadIdx.x + 768], s_D[threadIdx.x + 768]);

//Reduction

__syncthreads();

s_D[2 * threadIdx.x] += s_D[2 * threadIdx.x + 1];

s_D[2 * threadIdx.x + 512] += s_D[2 * threadIdx.x + 513];

__syncthreads();

s_D[4 * threadIdx.x] += s_D[4 * threadIdx.x + 2];

__syncthreads();

if(threadIdx.x < 128)

	s_D[8 * threadIdx.x] += s_D[8 * threadIdx.x + 4];

__syncthreads();

if(threadIdx.x < 64)

	s_D[16 * threadIdx.x] += s_D[16 * threadIdx.x + 8];

__syncthreads();

if(threadIdx.x < 32)

	s_D[32 * threadIdx.x] += s_D[32 * threadIdx.x + 16];

__syncthreads();

if(threadIdx.x < 16)

	s_D[64 * threadIdx.x] += s_D[64 * threadIdx.x + 32];

//Transfer sum results from shared to global memory

__syncthreads();

if(threadIdx.x < 16)

	g_Norm[16 * blockIdx.x + threadIdx.x] = s_D[64 * threadIdx.x];

}[/codebox]

The question is why kernel 2 executes significantly faster (0.558476 msec on a GeForce 9800 GT) than kernel 1 (0.880197 msec) even though it is much less advantageous in terms of shared memory bank conflicts? Analyzing ptxas info shows that in both cases the application can schedule 3 thread blocks (= 768 threads in that case) on each SM.

Kernel 1 ptxas info: Used 10 registers, 4184+28 bytes smem, 20 bytes cmem[1].

Kernel 2 ptxas info: Used 8 registers, 4124+28 bytes smem, 24 bytes cmem[1].

Additional information on the source code:

Input parameters:

long int *g_X: vector of 64 elements (values between 0 and 255)

long int *g_D: vector of 32768 * 64 = 2097152 elements (values between 0 and 255)

Output parameters:

long int *g_Norm: vector of 32768 elements

Execution configuration:

2048 thread blocks of 256 threads each

LSChien · March 4, 2010, 1:42am

which part of kernel 2 do you think that 16-way bank conflicts are generated?

Johannes_Niedermaier · March 4, 2010, 2:26am

Basically it’s the section marked “//Reduction” that is generating shared memory bank conflicts, where bank conflicts are increasing when proceeding towards the end of that section. The last instruction (see codebox) for example generates a 16-way bank conflict: Each thread with threadIdx.x from 0 through 15 accesses a position which is a multiple of 64 within the shared memory variable s_D. Since s_D is a long int every multiple of 16 is located within bank 0. Therefore threads 0 through 15 access 16 different addresses within bank 0 generating a 16-way bank conflict.

[codebox]if(threadIdx.x < 16)

s_D[64 * threadIdx.x] += s_D[64 * threadIdx.x + 32];[/codebox]

I"ll be glad to provide more clarifications if needed. Thank you very much for any assistance concerning the described problem.

Topic		Replies	Views
shared memory performance kernel execution timings with one block CUDA Programming and Performance	3	3168	May 6, 2007
Registers and Shared Memory CUDA Programming and Performance	18	3807	June 16, 2010
How to understand the bank conflict of shared_mem CUDA Programming and Performance	12	7549	January 16, 2025
Problem with bank conflict. Something wrong with my experiment?Confused! CUDA Programming and Performance	4	1239	February 26, 2009
A basic question about shared memory conflict of a simple example CUDA Programming and Performance	4	2275	October 4, 2013
Using Shared Memory in CUDA C/C++ Technical Blog	36	1943	October 8, 2020
Shared memory question CUDA Programming and Performance	27	7315	June 23, 2008
Shared Memory usage slows kernel with texture fetch CUDA Programming and Performance	8	4143	June 20, 2011
Shared memory problem CUDA Programming and Performance	10	3927	April 20, 2010
Memory problem? ...incredible slowdown CUDA Programming and Performance	29	16290	January 30, 2011

Shared memory access patterns

Related topics