Reading the same memory with many threads

Hello, I’m using the current kernel, it works but it is rather slow. I guess the problem is that all the threads want to read from the same place at the same time when they try to read the parameter vector? Is there an easy way to solve this?

__global__ void InterpolateVolumeTriLinear(Complex* Volume, float* Parameter_Vector)



	int x = threadIdx.x;

	int y = blockIdx.y;

	int z = blockIdx.x;

	int idx = x + y*blockDim.x + z*blockDim.x*gridDim.y;

	float3 Motion_Vector;

	if (x < (DATA_W - PADDING_X) && y < (DATA_H - PADDING_Y) && z < (DATA_D - PADDING_Z))


		// (motion_vector.x)	(p0)   (p3  p4  p5)	  (x)

		// (motion_vector.y) =  (p1) + (p6  p7  p8)  *   (y)

 		// (motion_vector.z)	(p2)   (p9 p10 p11)	  (z)


		Motion_Vector.x = Parameter_Vector[0] + Parameter_Vector[3] * x + Parameter_Vector[4]   * y + Parameter_Vector[5]   * z;

		Motion_Vector.y = Parameter_Vector[1] + Parameter_Vector[6] * x + Parameter_Vector[7]   * y + Parameter_Vector[8]   * z;

		Motion_Vector.z = Parameter_Vector[2] + Parameter_Vector[9] * x + Parameter_Vector[10]  * y + Parameter_Vector[11]  * z;


		Volume[idx].x = tex3D(tex_Modified_Volume, Motion_Vector.x + 0.5f, Motion_Vector.y + 0.5f, Motion_Vector.z + 0.5f);

		Volume[idx].y = 0;




A simple solution would be to put Parameter_Vector into constant mem. If it’s always just 12 floats, you could even pass it per-value, which will result in it being stored in shared mem. But constant mem would be cleaner.

Another idea:

the first 12 threads copy ParameterVector to shared memory, followed by a __syncthreads()

From then on you access shared memory only. Because all threads of a warp are going
to read the same location, you’ll benefit from the broadcast mechanism of shared memory.


How do I write that in code?

How do I write that in code?

For the constant approach, just declare your constant [font=“Courier New”](“constant float Parameter_Vector[10][/font]”), then use [font=“Courier New”]cudaMemcpyToSymbol[/font] in your host code to initialize the constants before you call your kernel. Since Parameter_Vector is declared globally as a constant in this case, you do not need to pass it in to the kernel as a parameter.

For the shared memory approach, just add the following to the beginning of your kernel:

extern __shared__ float Shared_Parameter_Vector[];

if (threadIdx.x < 10)

	Shared_Parameter_Vector[threadIdx.x] = Parameter_Vector[threadIdx.x];


Then use Shared_Parameter_Vector in place of Parameter_Vector in the remainder of your kernel. Just remember to allocate space for the shared memory in your kernel launch.

If I declare it as constant, does it only mean that the memory is constant or that the variable is constant, i.e. can I change the values in Pararameter_Vector ?

How do I allocate space for the shared memory in the kernel launch?

Neither. It means that it lives in the constant memory region of the GPU. Read on on the memory hierarchy in the programming guide for more details. Most pertinent to your case is that constant memory is the absolute fastest memory on the GPU for broadcast reading (all threads in a warp read the same element of the array).

The third argument in the kernel parameters kernel<<<grid,threads,smem>>>(). It’s all in the programming guide.