float2 to float array Bank conflicts question

Hi,

I have a float2 array. I want to get the ‘x’ part of it to create a float array.

My problem is that i do not know how to avoid bank conflicts.

Here’s what i do :

void complex2real_kernel(Complex* in, Real* out,

       int width, int heigth)

{

  __shared__ Complex block[BLOCK_DIM*BLOCK_DIM];

  

  int xIndex = blockIdx.x * blockDim.x + threadIdx.x;

  int yIndex = blockIdx.y * blockDim.y + threadIdx.y;

  

  int index = yIndex * width + xIndex;

 // copy to shared mem

  block[threadIdx.x + threadIdx.y*BLOCK_DIM] = in[index];

 __syncthreads();

 out[index] = block[threadIdx.x + threadIdx.y*BLOCK_DIM].x;

}

The shared mem reading cause bank conflicts.

Is there a way to avoid them ?

Thanks a lot. =)

oYo

  1. In the kernel you give, global memory is going to be the bottleneck so a few piddly bank conflicts will not slow your performance.

  2. Why do you even need shared memory in this kernel? You aren’t sharing values between threads in a block, so just dump in[index] into a local float2 variable.

First, thanks for your answer.

  1. I see. I ask the question about this kernel because i have this kind of problems with another one, a float2 array transpose.

I wanted to use the SDK example and modify it to transpose a float2 array. But i get a lot of bank conflicts. I’m wondering about how i could avoid them.

  1. That’s what i thought. But when i just use a variable, like this :
float2 a;

a = in[index];

out[index] = a.x;

I get uncoalesced loads and bad performance. Does nvcc “optimizes” this by removing the “useless” variable ?

Take a look at slide 50 of [url=“http://www.gpgpu.org/sc2007/SC07_CUDA_3_Libraries.pdf”]http://www.gpgpu.org/sc2007/SC07_CUDA_3_Libraries.pdf[/url].
You need to use a volatile to force the vector read.

Thank you, that’s exactly what i needed.

I should create a new topic, but…

Is there an easy way to transpose a float2 array avoiding bank conflicts ? :huh:

Thanks again.