help getting shared memory working

I have a kernel that works using global memory. I’ve made an attempt at making it use shared memory, but it’s not working. I’m hoping that someone could help me out and point out what I’m doing wrong.

The following are the portions of the kernel which differ between the two versions.

Using global memory I have:

for(unsigned int k = 0; k != number_of_terms; ++k) 

	{

  float v = 6.28318530717959*dot(f_coord, d_positions[k]);

  float term = d_weights[k]*fourier_transform_sph(r*d_radii[k]);

 sum.x += term*cos(v);

  sum.y += term*(-sin(v));

	}

Using shared memory I have:

int thread_index = blockDim.x*threadIdx.y + threadIdx.x;

	__shared__ float3 ds_positions[BLOCK_SIZE];

	__shared__ float ds_weights[BLOCK_SIZE];

	__shared__ float ds_radii[BLOCK_SIZE];

	for(unsigned int k = 0; k < number_of_terms;) 

	{

 if (k+thread_index < number_of_terms)

  {

  	ds_positions[thread_index] = d_positions[k + thread_index];

  	ds_weights[thread_index] = d_weights[k + thread_index];

  	ds_radii[thread_index] = d_radii[k + thread_index];

  }

 __syncthreads();

 for(int j = 0; j != BLOCK_SIZE && k < number_of_terms; ++j, ++k)

  {

  	float v = 6.28318530717959*dot(f_coord, ds_positions[j]);

  	float term = ds_weights[j]*fourier_transform_sph(r*ds_radii[j]);

  	sum.x += term*cos(v);

  	sum.y += term*(-sin(v));

  }

  __syncthreads();

	}

Note, in my code BLOCK_SIZE is the number of threads per block. I think that in the SDK examples it may be the square root of this.

The difference between the two when run is that the first version works perfectly, and the second version crashes.

I experimented further with the code to try to understand what’s going on. It turns out that:

  1. If I only stage positions then it works - kind of, the results change frame to frame, which shouldn’t happen.

  2. If I only stage radii then it works perfectly.

  3. If I only stage weights it will sort of flicker at first with something related to the correct results but eventually will go black

  4. If I stage both positions and radii then it crashes.

Thanks for any help.

Have you considered packing the 3 SM’s into 1, so that ds_weights starts from the end of ds_positions in the SM space? you know SM is merely a register array, thus different ds_names doesn’t ganrantee that they won’t overlap.

from your observances I find an interesting guess: your 3 ds_'s are overlapped. which exactly draws a conclusion of all your errors.

Pls refer to Programming Guide 0.8.1 for the packing.

ps I 've encountered a case when even packing the 3 ds_arrays following the guide doesn’t work. Then I packed the 3 different groups of data contents into 1 single ds_ and it worked. weird uh?

@acorrigan: the kernel code with the 3 shared arrays looks good. I assume you are aware that all threads in this code

 for(int j = 0; j != BLOCK_SIZE && k < number_of_terms; ++j, ++k)

  {

  	float v = 6.28318530717959*dot(f_coord, ds_positions[j]);

  	float term = ds_weights[j]*fourier_transform_sph(r*ds_radii[j]);

  	sum.x += term*cos(v);

  	sum.y += term*(-sin(v));

  }

compute the same sum. You should vectorize this loop. What are the subroutines dot and fourier_transform_sph doing memory wise?

@yk_cadcg: You comments about the shared mem array starts is wrong. Shared mem defined with static size are always allocated properly. You only have to split shared mem pointers that refer to extern declared shared mem (dynamic size from the kernel call) as they obviously cannot be catered for at compile time. See manual section 4.2.2.3

Peter

I’m sorry, you are right. I didn’t think carefully.

Thank you both for your help.

Each thread is in fact computing a different value in that loop since r and f_coord are not constants, they are both variables directly dependent on the thread id. In this case do you still suggest vectorizing this loop? Regarding dot and fourier_transform_sph, they aren’t doing much. dot is just the dot product of two float3’s returned as a float:

__device__ float dot(float3 a, float3 b)

{

	return a.x*b.x + a.y*b.y + a.z*b.z;

}

fourier_transform_sph is a function mapping from one float to another, its graph looks qualitatively like a gaussian.

__device__ float fourier_transform_sph(float r)

{

	if(r >= 0.06) 

	{

  float m = pi*r;

  float _2_m = 2.0*m;

  float cos_2_m = cos(_2_m);

  return ((pi*0.75)/(m*m*m*m*m*m))*(cos_2_m-1.0)*(cos_2_m+m*sin(_2_m)-1.0);

	}

	return (pi + r*(-0.007968913156311 + r*(-18.293608272337678)));  // a quadratic interpolant near zero

}

Since what yk_cadcg suggested is apparently not the problem, do you have any idea as to what may be going wrong?

Yes, if r and f_coord depend on threadIdx that’s different. The two subroutines look unsuspicious (for performance, you might want to use the single precision math functions and specify the constants as float literals).

I can’t see any technical reason for the described behavior. Semantically, I don’t know what the code is supposed to do, but I would question the necessity of the outer for loop. As k is incremented in the second inner loop, it does nothing. Copy+past error?

Peter

Thanks for the hint about using single-precision (I’ll definitely use that optimization) and questioning the necessity of the outer loop. I thought it through I’m confident that loop is necessary, in case you’re interested in why, I explain below. Before I go into that, I just want to say that from my perspective the problem still stands, so I’d really appreciate any further input from you or anyone else.

Regarding the necessity of the outer loop: each thread is computing a summation of number_of_terms terms. Each iteration of the outer loop does a partial sum of ‘BLOCK_SIZE’ terms. ‘BLOCK_SIZE’ (usually 16^2) is much smaller than ‘number_of_terms’ (~1e6) so certainly the inner loop must run more than once. The outer loop variable k is used to keep track of the number of terms. In particular during the last outer iteration, if number_of_terms is not a multiple of BLOCK_SIZE, k will eventually be >= number_of_terms. the if-statement involving k avoids out-of-bound errors in accessing the d_* arrays, and then k is used as a stopping condition on the inner loop to stop it before it sums garbage terms.

OK, I see. Nice sliding window approach.

So here is one thing to try: Putting the fragment into nvcc and looking at the assembler output, you see the deprecation of the shared attribute (a current bug described already several times on this forum). Change the thing to

int thread_index = blockDim.x*threadIdx.y + threadIdx.x;

__shared__ float ds_positions[BLOCK_SIZE*3];

__shared__ float ds_weights[BLOCK_SIZE];

__shared__ float ds_radii[BLOCK_SIZE];

for(unsigned int k = 0; k < number_of_terms;)

{

if (k+thread_index < number_of_terms)

 {

  ds_positions[3*thread_index  ] = d_positions[k + thread_index].x;

  ds_positions[3*thread_index+1] = d_positions[k + thread_index].y;

  ds_positions[3*thread_index+2] = d_positions[k + thread_index].z;

  ds_weights[thread_index] = d_weights[k + thread_index];

  ds_radii[thread_index] = d_radii[k + thread_index];

 }

__syncthreads();

for(int j = 0; j != BLOCK_SIZE && k < number_of_terms; ++j, ++k)

 {

  float v = 6.28318530717959*dot(f_coord, make_float3(ds_positions[3*j],ds_positions[3*j+1],ds_positions[3*j+2]));

  float term = ds_weights[j]*fourier_transform_sph(r*ds_radii[j]);

  sum.x += term*cos(v);

  sum.y += term*(-sin(v));

 }

 __syncthreads();

}

See if that helps.

Peter

Using that modification the code no longer crashes. However, it still doesn’t work since the image is just black. Tomorrow, I’ll try allocating shared memory using the third parameter inside of <<< >>>, instead of manually sizing the arrays, just in case that works. Meanwhile, any further input is appreciated :)

That didn’t work either, even when I just stage one of the arrays.

Since 0.9 came out I gave it another shot, and in case anyone is interested in what happened, I can describe what the problem actually ended up being: In the kernel, before the block of code I posted, I had a conditional return that would return for one particular column of threads. That messes up loading shared memory, since for all of the blocks containing that column of threads, a column’s worth of shared memory wouldn’t have been set to anything from global memory (since the thread that was supposed to do this had already returned), and so then garbage would be read in during the second loop.