POSSIBLE BANK CONFLICTS? GPU faster on medium size data, slower on huge data!

Hi, I am implementing a FIR filter using circular buffer on CUDA and i get good results when the input signal (deelay line) is relatively small, meaning that the GPU is faster than the CPU. However if I increase the input signal around 1 million the CPU is faster as it is for very small data!! I’ve tried to understand something on cudaprof, and my guess is that I am writing on the shared memory and there are some bank conflicts.

__global__ void fir(float* res){

	__shared__ float th_r[THREAD_N];

	const int tId = threadIdx.x;

	for(int b_id = blockIdx.x; b_id < BLOCK_N; b_id += 1){

		int slice_base=SLICE*tId;

		int count = THREAD_N * b_id +slice_base;

		for (int j=0; j<SLICE; j++, count++) {

			int count2=count;

			th_r[tId]=0;

			for (int k=0; k<TAPS; k++) {

				th_r[tId]+=tex1Dfetch(texRef2, k)*tex1Dfetch(texRef1, count2--);

				if (count2<0) count2=SIG_SIZE-1;

			}

			res[count]=th_r[tId];

		}

	}

}

I used textures to avoid low latency with global memory. Any help is appreciated!! Thanks!!

I’m working on a GTS250, cuda 3.0, Linux (Fedora).

  1. Since you are not sharing th_r with other threads, you could just as easily use an automatic variable (register), it seems to me. End of bank-conflicts.

  2. FIR is a kind of convolution. There are convolution examples in the SDK which discuss the issues (like bank-conflicts). Although the examples deal with images, the filters are separable, meaning that they are implemented as two linear convolutions, which makes them useful to you, I think.

  3. There are tools for checking both race conditions and bank-conflict, see e.g. http://citeseerx.ist.psu.edu/viewdoc/downl…p1&type=pdf

  4. bank conflicts arise when several threads of a half-warp access the same shared memory location; in your code each thread uses only it’s “own” location, so no bank-conflict afaik.

  5. Your taps seem to be in texRef2, why not in constant memory?

  6. Your blocks have very different workloads (varying between BLOCK_N and 0, b_id loop), which could explain bad scaling (?)

So, my advice is to redesign the way your blocks function and, less important, use a automatic var for th_r.

Edit: you can also check the SDK example simpleCUFFT, it uses FFT for 1D convolution.

  1. Since you are not sharing th_r with other threads, you could just as easily use an automatic variable (register), it seems to me. End of bank-conflicts.

  2. FIR is a kind of convolution. There are convolution examples in the SDK which discuss the issues (like bank-conflicts). Although the examples deal with images, the filters are separable, meaning that they are implemented as two linear convolutions, which makes them useful to you, I think.

  3. There are tools for checking both race conditions and bank-conflict, see e.g. http://citeseerx.ist.psu.edu/viewdoc/downl…p1&type=pdf

  4. bank conflicts arise when several threads of a half-warp access the same shared memory location; in your code each thread uses only it’s “own” location, so no bank-conflict afaik.

  5. Your taps seem to be in texRef2, why not in constant memory?

  6. Your blocks have very different workloads (varying between BLOCK_N and 0, b_id loop), which could explain bad scaling (?)

So, my advice is to redesign the way your blocks function and, less important, use a automatic var for th_r.

Edit: you can also check the SDK example simpleCUFFT, it uses FFT for 1D convolution.

Yeah Thanks!.. 1) and 5) are already done!. About the point 6, I am not sure that I understood what you mean. The workload should be the same for every block, depending on the size of the input data (delay line) I set the macros BLOCK_N and SLICE in order to have each block taking care of a certain partition of the vector, and each thread taking care of a certain numbers of positions within the assigned block partition. Why do you think that it is wrong?

I ll have a look to the samples anyway!.. Thank you very much!

Here’s the updated code, however I am not getting any better result, so I have to understand this block-workload issue you mentioned!

__device__ __constant__ float d_coeffs[TAPS];

__global__ void fir(float* res){

	

	const int tId = threadIdx.x;

	register float t_res;

	for(int b_id = blockIdx.x; b_id < BLOCK_N; b_id += 1){

		int slice_base=SLICE*tId;

		int count = THREAD_N * b_id +slice_base;

		for (int j=0; j<SLICE; j++, count++) {

			int count2=count;

			t_res=0;

			for (int k=0; k<TAPS; k++) {

				t_res+=d_coeffs[k]*tex1Dfetch(texRef1, count2--);

				if (count2<0) count2=SIG_SIZE-1;

			}

			res[count]=t_res;

		}

	}

}

Yeah Thanks!.. 1) and 5) are already done!. About the point 6, I am not sure that I understood what you mean. The workload should be the same for every block, depending on the size of the input data (delay line) I set the macros BLOCK_N and SLICE in order to have each block taking care of a certain partition of the vector, and each thread taking care of a certain numbers of positions within the assigned block partition. Why do you think that it is wrong?

I ll have a look to the samples anyway!.. Thank you very much!

Here’s the updated code, however I am not getting any better result, so I have to understand this block-workload issue you mentioned!

__device__ __constant__ float d_coeffs[TAPS];

__global__ void fir(float* res){

	

	const int tId = threadIdx.x;

	register float t_res;

	for(int b_id = blockIdx.x; b_id < BLOCK_N; b_id += 1){

		int slice_base=SLICE*tId;

		int count = THREAD_N * b_id +slice_base;

		for (int j=0; j<SLICE; j++, count++) {

			int count2=count;

			t_res=0;

			for (int k=0; k<TAPS; k++) {

				t_res+=d_coeffs[k]*tex1Dfetch(texRef1, count2--);

				if (count2<0) count2=SIG_SIZE-1;

			}

			res[count]=t_res;

		}

	}

}

I am unsure of what you are doing as it is long since I did anything with filters.
Nevertheless, just by looking at your code I see that the outhermost for loop is the source of the imbalancement that jan mentioned.
Block 0 will iterate BLOCK_N times (with b_id going one by one from 1 to BLOCK_N-1)
while
Block x will iterate only BLOCK_N-x times (with b_id going one by one from x to BLOCK_N-1)

The inner for loop length does not depend on block index nor on the iteration count of b_id.
Therefore, the first block will work approximately BLOCK_N times longer than the last block.

If there is a way to balance your workload better, I believe you can cut your run time by half.

I am unsure of what you are doing as it is long since I did anything with filters.
Nevertheless, just by looking at your code I see that the outhermost for loop is the source of the imbalancement that jan mentioned.
Block 0 will iterate BLOCK_N times (with b_id going one by one from 1 to BLOCK_N-1)
while
Block x will iterate only BLOCK_N-x times (with b_id going one by one from x to BLOCK_N-1)

The inner for loop length does not depend on block index nor on the iteration count of b_id.
Therefore, the first block will work approximately BLOCK_N times longer than the last block.

If there is a way to balance your workload better, I believe you can cut your run time by half.

That’s it!.. you solved the problem mate!! Thank you!! I had to increment the b_id by dimGrid.x instead of 1!.. Now it’s running good, even if cudaprof says there are quite a bit of uncoalesced access, I suppose they are referred to the constant memory, the coefficients.

That’s it!.. you solved the problem mate!! Thank you!! I had to increment the b_id by dimGrid.x instead of 1!.. Now it’s running good, even if cudaprof says there are quite a bit of uncoalesced access, I suppose they are referred to the constant memory, the coefficients.