# Reading the same memory with many threads

Hello, I’m using the current kernel, it works but it is rather slow. I guess the problem is that all the threads want to read from the same place at the same time when they try to read the parameter vector? Is there an easy way to solve this?

``````__global__ void InterpolateVolumeTriLinear(Complex* Volume, float* Parameter_Vector)

{

int x = threadIdx.x;

int y = blockIdx.y;

int z = blockIdx.x;

int idx = x + y*blockDim.x + z*blockDim.x*gridDim.y;

float3 Motion_Vector;

if (x < (DATA_W - PADDING_X) && y < (DATA_H - PADDING_Y) && z < (DATA_D - PADDING_Z))

{

// (motion_vector.x)	(p0)   (p3  p4  p5)	  (x)

// (motion_vector.y) =  (p1) + (p6  p7  p8)  *   (y)

// (motion_vector.z)	(p2)   (p9 p10 p11)	  (z)

Motion_Vector.x = Parameter_Vector[0] + Parameter_Vector[3] * x + Parameter_Vector[4]   * y + Parameter_Vector[5]   * z;

Motion_Vector.y = Parameter_Vector[1] + Parameter_Vector[6] * x + Parameter_Vector[7]   * y + Parameter_Vector[8]   * z;

Motion_Vector.z = Parameter_Vector[2] + Parameter_Vector[9] * x + Parameter_Vector[10]  * y + Parameter_Vector[11]  * z;

Volume[idx].x = tex3D(tex_Modified_Volume, Motion_Vector.x + 0.5f, Motion_Vector.y + 0.5f, Motion_Vector.z + 0.5f);

Volume[idx].y = 0;

}

}
``````

Hi,

A simple solution would be to put Parameter_Vector into constant mem. If it’s always just 12 floats, you could even pass it per-value, which will result in it being stored in shared mem. But constant mem would be cleaner.

Another idea:

the first 12 threads copy ParameterVector to shared memory, followed by a __syncthreads()

From then on you access shared memory only. Because all threads of a warp are going
to read the same location, you’ll benefit from the broadcast mechanism of shared memory.

Christian

How do I write that in code?

How do I write that in code?

For the constant approach, just declare your constant [font=“Courier New”](“constant float Parameter_Vector[10][/font]”), then use [font=“Courier New”]cudaMemcpyToSymbol[/font] in your host code to initialize the constants before you call your kernel. Since Parameter_Vector is declared globally as a constant in this case, you do not need to pass it in to the kernel as a parameter.

For the shared memory approach, just add the following to the beginning of your kernel:

``````extern __shared__ float Shared_Parameter_Vector[];

if (threadIdx.x < 10)