Avoiding shared memory bank conflicts

tilo · October 11, 2010, 8:07pm

Hi

I have implemented a algorithm in my fermi card but when I profile it it seems that I get quite a lot of shared memory bank conflicts. The problem is that I have a hard time understanding how to avoid bank conflicts. I cannot actually find any concrete examples on the subject I want to try my luck here. In programming guide they just mention adding a stride s that is odd for 32 bit accesses of shared memory to avoid bank conflict. In GPU Gems 3: Chapter 39. Parallel Prefix Sum (Scan) with CUDA the have this somewhat to me unclear bank conflict avoider macro:

1. #define NUM_BANKS 16  

   2. #define LOG_NUM_BANKS 4  

   3. #define CONFLICT_FREE_OFFSET(n) \  

   4.	 ((n) >> NUM_BANKS + (n) >> (2 * LOG_NUM_BANKS))

i cannot really understand it.

In my code (Radix sort impl) I this piece of code that I think will give bank conflict:

__global__ void Sort_blocks(uint* d_OKeys, uint* d_OValues, 

							uint* d_keys, uint* d_values, 

							const unsigned int startBit, 

							const unsigned int shiftAmount)

{

		volatile __shared__ uint s_keys[RADIX_CTA_SZ*RADIX_ELTS_T]; //1024. Perform scattering on sh mem

	volatile __shared__ uint s_values[RADIX_CTA_SZ*RADIX_ELTS_T];//1024. values

	volatile __shared__ uint s_data[RADIX_CTA_SZ];	//256. Used for prefix sum

	const uint devOffset = blockIdx.x*RADIX_CTA_SZ;

	const uint devOffset2 = blockIdx.x*(RADIX_CTA_SZ << 2);

	const uint sIdx = threadIdx.x << 2; //multiply 4 because we load 4 elements at a time

	uint4 *temp = (uint4*)d_keys;

	uint4 *tVal = (uint4*)d_values;

	uint4 key   = temp[threadIdx.x + devOffset];

	uint4 val   = tVal[threadIdx.x + devOffset];

	s_keys[sIdx]	= key.x;

	s_keys[sIdx+1]  = key.y;

	s_keys[sIdx+2]  = key.z;

	s_keys[sIdx+3]  = key.w;

	s_values[sIdx]   = val.x;

	s_values[sIdx+1] = val.y;

	s_values[sIdx+2] = val.z;

	s_values[sIdx+3] = val.w;

		//...do some other things...

	d_OKeys[sIdx + devOffset2]	= s_keys[sIdx];

	d_OKeys[sIdx + devOffset2 +1] = s_keys[sIdx+1];

	d_OKeys[sIdx + devOffset2 +2] = s_keys[sIdx+2];

	d_OKeys[sIdx + devOffset2 +3] = s_keys[sIdx+3];

	d_OValues[sIdx + devOffset2]	= s_values[sIdx];

	d_OValues[sIdx + devOffset2 +1] = s_values[sIdx+1];

	d_OValues[sIdx + devOffset2 +2] = s_values[sIdx+2];

	d_OValues[sIdx + devOffset2 +3] = s_values[sIdx+3];

}

there are some more shared memory acesses but they all use the same access pattern. I convert my global pointers of unsigned int* to uint4* and then loads 4 elements at a time on shared memory for a thread. I guess the bank conflicts come from the fact that thread 5 (and subsequent threads. The rest because in FERMI it handles sh mem access per warp and not half warp?) in the same warp will access 4 elements belonging to the 4 first sh mem banks. Am I correct? So I will have bank conflicts. How do I avoid this? Is it even possible? I get around 1000000 shared memory bank conflict when sorting 12 million elements, so I guess my performance are quite low.

Thank!

tilo · October 11, 2010, 8:07pm

Hi

I have implemented a algorithm in my fermi card but when I profile it it seems that I get quite a lot of shared memory bank conflicts. The problem is that I have a hard time understanding how to avoid bank conflicts. I cannot actually find any concrete examples on the subject I want to try my luck here. In programming guide they just mention adding a stride s that is odd for 32 bit accesses of shared memory to avoid bank conflict. In GPU Gems 3: Chapter 39. Parallel Prefix Sum (Scan) with CUDA the have this somewhat to me unclear bank conflict avoider macro:

1. #define NUM_BANKS 16  

   2. #define LOG_NUM_BANKS 4  

   3. #define CONFLICT_FREE_OFFSET(n) \  

   4.	 ((n) >> NUM_BANKS + (n) >> (2 * LOG_NUM_BANKS))

i cannot really understand it.

In my code (Radix sort impl) I this piece of code that I think will give bank conflict:

__global__ void Sort_blocks(uint* d_OKeys, uint* d_OValues, 

							uint* d_keys, uint* d_values, 

							const unsigned int startBit, 

							const unsigned int shiftAmount)

{

		volatile __shared__ uint s_keys[RADIX_CTA_SZ*RADIX_ELTS_T]; //1024. Perform scattering on sh mem

	volatile __shared__ uint s_values[RADIX_CTA_SZ*RADIX_ELTS_T];//1024. values

	volatile __shared__ uint s_data[RADIX_CTA_SZ];	//256. Used for prefix sum

	const uint devOffset = blockIdx.x*RADIX_CTA_SZ;

	const uint devOffset2 = blockIdx.x*(RADIX_CTA_SZ << 2);

	const uint sIdx = threadIdx.x << 2; //multiply 4 because we load 4 elements at a time

	uint4 *temp = (uint4*)d_keys;

	uint4 *tVal = (uint4*)d_values;

	uint4 key   = temp[threadIdx.x + devOffset];

	uint4 val   = tVal[threadIdx.x + devOffset];

	s_keys[sIdx]	= key.x;

	s_keys[sIdx+1]  = key.y;

	s_keys[sIdx+2]  = key.z;

	s_keys[sIdx+3]  = key.w;

	s_values[sIdx]   = val.x;

	s_values[sIdx+1] = val.y;

	s_values[sIdx+2] = val.z;

	s_values[sIdx+3] = val.w;

		//...do some other things...

	d_OKeys[sIdx + devOffset2]	= s_keys[sIdx];

	d_OKeys[sIdx + devOffset2 +1] = s_keys[sIdx+1];

	d_OKeys[sIdx + devOffset2 +2] = s_keys[sIdx+2];

	d_OKeys[sIdx + devOffset2 +3] = s_keys[sIdx+3];

	d_OValues[sIdx + devOffset2]	= s_values[sIdx];

	d_OValues[sIdx + devOffset2 +1] = s_values[sIdx+1];

	d_OValues[sIdx + devOffset2 +2] = s_values[sIdx+2];

	d_OValues[sIdx + devOffset2 +3] = s_values[sIdx+3];

}

there are some more shared memory acesses but they all use the same access pattern. I convert my global pointers of unsigned int* to uint4* and then loads 4 elements at a time on shared memory for a thread. I guess the bank conflicts come from the fact that thread 5 (and subsequent threads. The rest because in FERMI it handles sh mem access per warp and not half warp?) in the same warp will access 4 elements belonging to the 4 first sh mem banks. Am I correct? So I will have bank conflicts. How do I avoid this? Is it even possible? I get around 1000000 shared memory bank conflict when sorting 12 million elements, so I guess my performance are quite low.

Thank!

tilo · October 12, 2010, 8:56pm

Hi

I have implemented a algorithm in my fermi card but when I profile it it seems that I get quite a lot of shared memory bank conflicts. The problem is that I have a hard time understanding how to avoid bank conflicts. I cannot actually find any concrete examples on the subject I want to try my luck here. In programming guide they just mention adding a stride s that is odd for 32 bit accesses of shared memory to avoid bank conflict. In GPU Gems 3: Chapter 39. Parallel Prefix Sum (Scan) with CUDA the have this somewhat to me unclear bank conflict avoider macro:
1. #define NUM_BANKS 16  

   2. #define LOG_NUM_BANKS 4  

   3. #define CONFLICT_FREE_OFFSET(n) \  

   4.	 ((n) >> NUM_BANKS + (n) >> (2 * LOG_NUM_BANKS))
i cannot really understand it.

In my code (Radix sort impl) I this piece of code that I think will give bank conflict:
__global__ void Sort_blocks(uint* d_OKeys, uint* d_OValues, 

							uint* d_keys, uint* d_values, 

							const unsigned int startBit, 

							const unsigned int shiftAmount)

{

		volatile __shared__ uint s_keys[RADIX_CTA_SZ*RADIX_ELTS_T]; //1024. Perform scattering on sh mem

	volatile __shared__ uint s_values[RADIX_CTA_SZ*RADIX_ELTS_T];//1024. values

	volatile __shared__ uint s_data[RADIX_CTA_SZ];	//256. Used for prefix sum

	const uint devOffset = blockIdx.x*RADIX_CTA_SZ;

	const uint devOffset2 = blockIdx.x*(RADIX_CTA_SZ << 2);

	const uint sIdx = threadIdx.x << 2; //multiply 4 because we load 4 elements at a time

	uint4 *temp = (uint4*)d_keys;

	uint4 *tVal = (uint4*)d_values;

	uint4 key   = temp[threadIdx.x + devOffset];

	uint4 val   = tVal[threadIdx.x + devOffset];

	s_keys[sIdx]	= key.x;

	s_keys[sIdx+1]  = key.y;

	s_keys[sIdx+2]  = key.z;

	s_keys[sIdx+3]  = key.w;

	s_values[sIdx]   = val.x;

	s_values[sIdx+1] = val.y;

	s_values[sIdx+2] = val.z;

	s_values[sIdx+3] = val.w;

		//...do some other things...

	d_OKeys[sIdx + devOffset2]	= s_keys[sIdx];

	d_OKeys[sIdx + devOffset2 +1] = s_keys[sIdx+1];

	d_OKeys[sIdx + devOffset2 +2] = s_keys[sIdx+2];

	d_OKeys[sIdx + devOffset2 +3] = s_keys[sIdx+3];

	d_OValues[sIdx + devOffset2]	= s_values[sIdx];

	d_OValues[sIdx + devOffset2 +1] = s_values[sIdx+1];

	d_OValues[sIdx + devOffset2 +2] = s_values[sIdx+2];

	d_OValues[sIdx + devOffset2 +3] = s_values[sIdx+3];

}
there are some more shared memory acesses but they all use the same access pattern. I convert my global pointers of unsigned int* to uint4* and then loads 4 elements at a time on shared memory for a thread. I guess the bank conflicts come from the fact that thread 5 (and subsequent threads. The rest because in FERMI it handles sh mem access per warp and not half warp?) in the same warp will access 4 elements belonging to the 4 first sh mem banks. Am I correct? So I will have bank conflicts. How do I avoid this? Is it even possible? I get around 1000000 shared memory bank conflict when sorting 12 million elements, so I guess my performance are quite low.

Thank!

I have been looking into CUDPP implementation of radix sort and they have a part where they stride the shared memory read and writes.

code snippet from CUDPP:

// This arithmetic strides the ranks across 4 SORT_CTA_SIZE regions

		sMem1[(r.x & 3) * SORT_CTA_SIZE + (r.x >> 2)] = key.x;

		sMem1[(r.y & 3) * SORT_CTA_SIZE + (r.y >> 2)] = key.y;

		sMem1[(r.z & 3) * SORT_CTA_SIZE + (r.z >> 2)] = key.z;

		sMem1[(r.w & 3) * SORT_CTA_SIZE + (r.w >> 2)] = key.w; 

		__syncthreads();

		// The above allows us to read without 4-way bank conflicts:

		key.x = sMem1[threadIdx.x];

		key.y = sMem1[threadIdx.x +	 SORT_CTA_SIZE];

		key.z = sMem1[threadIdx.x + 2 * SORT_CTA_SIZE];

		key.w = sMem1[threadIdx.x + 3 * SORT_CTA_SIZE];

Here uint4 key is the key to be sorted. uint4 is the rank of key key. I can understand that when they read the data to key this acces is not giving a bank conflict because each thread is accessing shared memory linearly (ex: thread i acceses index i) and the keys are 32 bit (1 key is in 1 bank). But what I cannot understand is why the first part of the code is needed, where the keys are written to shared memory. I mean the rank r is totally dependent on the (hopefully unsorted) input data and rank r cannot be predicted.

Why is the upper half of the code needed in order for the lower part of the code to be bank conflict free? Upper part can have shared memory bank conflicts am I correct?

tilo · October 12, 2010, 8:56pm

Hi

I have implemented a algorithm in my fermi card but when I profile it it seems that I get quite a lot of shared memory bank conflicts. The problem is that I have a hard time understanding how to avoid bank conflicts. I cannot actually find any concrete examples on the subject I want to try my luck here. In programming guide they just mention adding a stride s that is odd for 32 bit accesses of shared memory to avoid bank conflict. In GPU Gems 3: Chapter 39. Parallel Prefix Sum (Scan) with CUDA the have this somewhat to me unclear bank conflict avoider macro:
1. #define NUM_BANKS 16  

   2. #define LOG_NUM_BANKS 4  

   3. #define CONFLICT_FREE_OFFSET(n) \  

   4.	 ((n) >> NUM_BANKS + (n) >> (2 * LOG_NUM_BANKS))
i cannot really understand it.

In my code (Radix sort impl) I this piece of code that I think will give bank conflict:
__global__ void Sort_blocks(uint* d_OKeys, uint* d_OValues, 

							uint* d_keys, uint* d_values, 

							const unsigned int startBit, 

							const unsigned int shiftAmount)

{

		volatile __shared__ uint s_keys[RADIX_CTA_SZ*RADIX_ELTS_T]; //1024. Perform scattering on sh mem

	volatile __shared__ uint s_values[RADIX_CTA_SZ*RADIX_ELTS_T];//1024. values

	volatile __shared__ uint s_data[RADIX_CTA_SZ];	//256. Used for prefix sum

	const uint devOffset = blockIdx.x*RADIX_CTA_SZ;

	const uint devOffset2 = blockIdx.x*(RADIX_CTA_SZ << 2);

	const uint sIdx = threadIdx.x << 2; //multiply 4 because we load 4 elements at a time

	uint4 *temp = (uint4*)d_keys;

	uint4 *tVal = (uint4*)d_values;

	uint4 key   = temp[threadIdx.x + devOffset];

	uint4 val   = tVal[threadIdx.x + devOffset];

	s_keys[sIdx]	= key.x;

	s_keys[sIdx+1]  = key.y;

	s_keys[sIdx+2]  = key.z;

	s_keys[sIdx+3]  = key.w;

	s_values[sIdx]   = val.x;

	s_values[sIdx+1] = val.y;

	s_values[sIdx+2] = val.z;

	s_values[sIdx+3] = val.w;

		//...do some other things...

	d_OKeys[sIdx + devOffset2]	= s_keys[sIdx];

	d_OKeys[sIdx + devOffset2 +1] = s_keys[sIdx+1];

	d_OKeys[sIdx + devOffset2 +2] = s_keys[sIdx+2];

	d_OKeys[sIdx + devOffset2 +3] = s_keys[sIdx+3];

	d_OValues[sIdx + devOffset2]	= s_values[sIdx];

	d_OValues[sIdx + devOffset2 +1] = s_values[sIdx+1];

	d_OValues[sIdx + devOffset2 +2] = s_values[sIdx+2];

	d_OValues[sIdx + devOffset2 +3] = s_values[sIdx+3];

}
there are some more shared memory acesses but they all use the same access pattern. I convert my global pointers of unsigned int* to uint4* and then loads 4 elements at a time on shared memory for a thread. I guess the bank conflicts come from the fact that thread 5 (and subsequent threads. The rest because in FERMI it handles sh mem access per warp and not half warp?) in the same warp will access 4 elements belonging to the 4 first sh mem banks. Am I correct? So I will have bank conflicts. How do I avoid this? Is it even possible? I get around 1000000 shared memory bank conflict when sorting 12 million elements, so I guess my performance are quite low.

Thank!

I have been looking into CUDPP implementation of radix sort and they have a part where they stride the shared memory read and writes.

code snippet from CUDPP:

// This arithmetic strides the ranks across 4 SORT_CTA_SIZE regions

		sMem1[(r.x & 3) * SORT_CTA_SIZE + (r.x >> 2)] = key.x;

		sMem1[(r.y & 3) * SORT_CTA_SIZE + (r.y >> 2)] = key.y;

		sMem1[(r.z & 3) * SORT_CTA_SIZE + (r.z >> 2)] = key.z;

		sMem1[(r.w & 3) * SORT_CTA_SIZE + (r.w >> 2)] = key.w; 

		__syncthreads();

		// The above allows us to read without 4-way bank conflicts:

		key.x = sMem1[threadIdx.x];

		key.y = sMem1[threadIdx.x +	 SORT_CTA_SIZE];

		key.z = sMem1[threadIdx.x + 2 * SORT_CTA_SIZE];

		key.w = sMem1[threadIdx.x + 3 * SORT_CTA_SIZE];

Here uint4 key is the key to be sorted. uint4 is the rank of key key. I can understand that when they read the data to key this acces is not giving a bank conflict because each thread is accessing shared memory linearly (ex: thread i acceses index i) and the keys are 32 bit (1 key is in 1 bank). But what I cannot understand is why the first part of the code is needed, where the keys are written to shared memory. I mean the rank r is totally dependent on the (hopefully unsorted) input data and rank r cannot be predicted.

Why is the upper half of the code needed in order for the lower part of the code to be bank conflict free? Upper part can have shared memory bank conflicts am I correct?