help getting shared memory working

acorrigan · May 28, 2007, 4:27pm

I have a kernel that works using global memory. I’ve made an attempt at making it use shared memory, but it’s not working. I’m hoping that someone could help me out and point out what I’m doing wrong.

The following are the portions of the kernel which differ between the two versions.

Using global memory I have:

for(unsigned int k = 0; k != number_of_terms; ++k) 

	{

  float v = 6.28318530717959*dot(f_coord, d_positions[k]);

  float term = d_weights[k]*fourier_transform_sph(r*d_radii[k]);

 sum.x += term*cos(v);

  sum.y += term*(-sin(v));

	}

Using shared memory I have:

int thread_index = blockDim.x*threadIdx.y + threadIdx.x;

	__shared__ float3 ds_positions[BLOCK_SIZE];

	__shared__ float ds_weights[BLOCK_SIZE];

	__shared__ float ds_radii[BLOCK_SIZE];

	for(unsigned int k = 0; k < number_of_terms;) 

	{

 if (k+thread_index < number_of_terms)

  {

  	ds_positions[thread_index] = d_positions[k + thread_index];

  	ds_weights[thread_index] = d_weights[k + thread_index];

  	ds_radii[thread_index] = d_radii[k + thread_index];

  }

 __syncthreads();

 for(int j = 0; j != BLOCK_SIZE && k < number_of_terms; ++j, ++k)

  {

  	float v = 6.28318530717959*dot(f_coord, ds_positions[j]);

  	float term = ds_weights[j]*fourier_transform_sph(r*ds_radii[j]);

  	sum.x += term*cos(v);

  	sum.y += term*(-sin(v));

  }

  __syncthreads();

	}

Note, in my code BLOCK_SIZE is the number of threads per block. I think that in the SDK examples it may be the square root of this.

The difference between the two when run is that the first version works perfectly, and the second version crashes.

I experimented further with the code to try to understand what’s going on. It turns out that:

If I only stage positions then it works - kind of, the results change frame to frame, which shouldn’t happen.
If I only stage radii then it works perfectly.
If I only stage weights it will sort of flicker at first with something related to the correct results but eventually will go black
If I stage both positions and radii then it crashes.

Thanks for any help.

yk_cadcg · May 29, 2007, 9:14am

Have you considered packing the 3 SM’s into 1, so that ds_weights starts from the end of ds_positions in the SM space? you know SM is merely a register array, thus different ds_names doesn’t ganrantee that they won’t overlap.

from your observances I find an interesting guess: your 3 ds_'s are overlapped. which exactly draws a conclusion of all your errors.

Pls refer to Programming Guide 0.8.1 for the packing.

I have a kernel that works using global memory. I’ve made an attempt at making it use shared memory, but it’s not working. I’m hoping that someone could help me out and point out what I’m doing wrong.

The following are the portions of the kernel which differ between the two versions.

Using global memory I have:
for(unsigned int k = 0; k != number_of_terms; ++k) 

	{

  float v = 6.28318530717959*dot(f_coord, d_positions[k]);

  float term = d_weights[k]*fourier_transform_sph(r*d_radii[k]);

 sum.x += term*cos(v);

  sum.y += term*(-sin(v));

	}
Using shared memory I have:
int thread_index = blockDim.x*threadIdx.y + threadIdx.x;

	__shared__ float3 ds_positions[BLOCK_SIZE];

	__shared__ float ds_weights[BLOCK_SIZE];

	__shared__ float ds_radii[BLOCK_SIZE];

	for(unsigned int k = 0; k < number_of_terms;) 

	{

 if (k+thread_index < number_of_terms)

  {

  	ds_positions[thread_index] = d_positions[k + thread_index];

  	ds_weights[thread_index] = d_weights[k + thread_index];

  	ds_radii[thread_index] = d_radii[k + thread_index];

  }

 __syncthreads();

 for(int j = 0; j != BLOCK_SIZE && k < number_of_terms; ++j, ++k)

  {

  	float v = 6.28318530717959*dot(f_coord, ds_positions[j]);

  	float term = ds_weights[j]*fourier_transform_sph(r*ds_radii[j]);

  	sum.x += term*cos(v);

  	sum.y += term*(-sin(v));

  }

  __syncthreads();

	}
Note, in my code BLOCK_SIZE is the number of threads per block. I think that in the SDK examples it may be the square root of this.

The difference between the two when run is that the first version works perfectly, and the second version crashes.

I experimented further with the code to try to understand what’s going on. It turns out that:

If I only stage positions then it works - kind of, the results change frame to frame, which shouldn’t happen.

If I only stage radii then it works perfectly.

If I only stage weights it will sort of flicker at first with something related to the correct results but eventually will go black

If I stage both positions and radii then it crashes.

Thanks for any help.

[snapback]202484[/snapback]

yk_cadcg · May 29, 2007, 9:17am

ps I 've encountered a case when even packing the 3 ds_arrays following the guide doesn’t work. Then I packed the 3 different groups of data contents into 1 single ds_ and it worked. weird uh?

I have a kernel that works using global memory. I’ve made an attempt at making it use shared memory, but it’s not working. I’m hoping that someone could help me out and point out what I’m doing wrong.

The following are the portions of the kernel which differ between the two versions.

Using global memory I have:
for(unsigned int k = 0; k != number_of_terms; ++k) 

	{

  float v = 6.28318530717959*dot(f_coord, d_positions[k]);

  float term = d_weights[k]*fourier_transform_sph(r*d_radii[k]);

 sum.x += term*cos(v);

  sum.y += term*(-sin(v));

	}
Using shared memory I have:
int thread_index = blockDim.x*threadIdx.y + threadIdx.x;

	__shared__ float3 ds_positions[BLOCK_SIZE];

	__shared__ float ds_weights[BLOCK_SIZE];

	__shared__ float ds_radii[BLOCK_SIZE];

	for(unsigned int k = 0; k < number_of_terms;) 

	{

 if (k+thread_index < number_of_terms)

  {

  	ds_positions[thread_index] = d_positions[k + thread_index];

  	ds_weights[thread_index] = d_weights[k + thread_index];

  	ds_radii[thread_index] = d_radii[k + thread_index];

  }

 __syncthreads();

 for(int j = 0; j != BLOCK_SIZE && k < number_of_terms; ++j, ++k)

  {

  	float v = 6.28318530717959*dot(f_coord, ds_positions[j]);

  	float term = ds_weights[j]*fourier_transform_sph(r*ds_radii[j]);

  	sum.x += term*cos(v);

  	sum.y += term*(-sin(v));

  }

  __syncthreads();

	}
Note, in my code BLOCK_SIZE is the number of threads per block. I think that in the SDK examples it may be the square root of this.

The difference between the two when run is that the first version works perfectly, and the second version crashes.

I experimented further with the code to try to understand what’s going on. It turns out that:

If I only stage positions then it works - kind of, the results change frame to frame, which shouldn’t happen.

If I only stage radii then it works perfectly.

If I only stage weights it will sort of flicker at first with something related to the correct results but eventually will go black

If I stage both positions and radii then it crashes.

Thanks for any help.

[snapback]202484[/snapback]

prkipfer · May 29, 2007, 12:52pm

@acorrigan: the kernel code with the 3 shared arrays looks good. I assume you are aware that all threads in this code

 for(int j = 0; j != BLOCK_SIZE && k < number_of_terms; ++j, ++k)

  {

  	float v = 6.28318530717959*dot(f_coord, ds_positions[j]);

  	float term = ds_weights[j]*fourier_transform_sph(r*ds_radii[j]);

  	sum.x += term*cos(v);

  	sum.y += term*(-sin(v));

  }

compute the same sum. You should vectorize this loop. What are the subroutines dot and fourier_transform_sph doing memory wise?

@yk_cadcg: You comments about the shared mem array starts is wrong. Shared mem defined with static size are always allocated properly. You only have to split shared mem pointers that refer to extern declared shared mem (dynamic size from the kernel call) as they obviously cannot be catered for at compile time. See manual section 4.2.2.3

Peter

yk_cadcg · May 29, 2007, 1:59pm

I’m sorry, you are right. I didn’t think carefully.

@acorrigan: the kernel code with the 3 shared arrays looks good. I assume you are aware that all threads in this code
 for(int j = 0; j != BLOCK_SIZE && k < number_of_terms; ++j, ++k)

  {

  	float v = 6.28318530717959*dot(f_coord, ds_positions[j]);

  	float term = ds_weights[j]*fourier_transform_sph(r*ds_radii[j]);

  	sum.x += term*cos(v);

  	sum.y += term*(-sin(v));

  }
compute the same sum. You should vectorize this loop. What are the subroutines dot and fourier_transform_sph doing memory wise?

@yk_cadcg: You comments about the shared mem array starts is wrong. Shared mem defined with static size are always allocated properly. You only have to split shared mem pointers that refer to extern declared shared mem (dynamic size from the kernel call) as they obviously cannot be catered for at compile time. See manual section 4.2.2.3

Peter

[snapback]202804[/snapback]

acorrigan · May 29, 2007, 4:05pm

Thank you both for your help.

Each thread is in fact computing a different value in that loop since r and f_coord are not constants, they are both variables directly dependent on the thread id. In this case do you still suggest vectorizing this loop? Regarding dot and fourier_transform_sph, they aren’t doing much. dot is just the dot product of two float3’s returned as a float:

__device__ float dot(float3 a, float3 b)

{

	return a.x*b.x + a.y*b.y + a.z*b.z;

}

fourier_transform_sph is a function mapping from one float to another, its graph looks qualitatively like a gaussian.

__device__ float fourier_transform_sph(float r)

{

	if(r >= 0.06) 

	{

  float m = pi*r;

  float _2_m = 2.0*m;

  float cos_2_m = cos(_2_m);

  return ((pi*0.75)/(m*m*m*m*m*m))*(cos_2_m-1.0)*(cos_2_m+m*sin(_2_m)-1.0);

	}

	return (pi + r*(-0.007968913156311 + r*(-18.293608272337678)));  // a quadratic interpolant near zero

}

Since what yk_cadcg suggested is apparently not the problem, do you have any idea as to what may be going wrong?

prkipfer · May 29, 2007, 4:29pm

Yes, if r and f_coord depend on threadIdx that’s different. The two subroutines look unsuspicious (for performance, you might want to use the single precision math functions and specify the constants as float literals).

I can’t see any technical reason for the described behavior. Semantically, I don’t know what the code is supposed to do, but I would question the necessity of the outer for loop. As k is incremented in the second inner loop, it does nothing. Copy+past error?

Peter

acorrigan · May 29, 2007, 5:06pm

Thanks for the hint about using single-precision (I’ll definitely use that optimization) and questioning the necessity of the outer loop. I thought it through I’m confident that loop is necessary, in case you’re interested in why, I explain below. Before I go into that, I just want to say that from my perspective the problem still stands, so I’d really appreciate any further input from you or anyone else.

Regarding the necessity of the outer loop: each thread is computing a summation of number_of_terms terms. Each iteration of the outer loop does a partial sum of ‘BLOCK_SIZE’ terms. ‘BLOCK_SIZE’ (usually 16^2) is much smaller than ‘number_of_terms’ (~1e6) so certainly the inner loop must run more than once. The outer loop variable k is used to keep track of the number of terms. In particular during the last outer iteration, if number_of_terms is not a multiple of BLOCK_SIZE, k will eventually be >= number_of_terms. the if-statement involving k avoids out-of-bound errors in accessing the d_* arrays, and then k is used as a stopping condition on the inner loop to stop it before it sums garbage terms.

prkipfer · May 29, 2007, 5:29pm

OK, I see. Nice sliding window approach.

So here is one thing to try: Putting the fragment into nvcc and looking at the assembler output, you see the deprecation of the shared attribute (a current bug described already several times on this forum). Change the thing to

int thread_index = blockDim.x*threadIdx.y + threadIdx.x;

__shared__ float ds_positions[BLOCK_SIZE*3];

__shared__ float ds_weights[BLOCK_SIZE];

__shared__ float ds_radii[BLOCK_SIZE];

for(unsigned int k = 0; k < number_of_terms;)

{

if (k+thread_index < number_of_terms)

 {

  ds_positions[3*thread_index  ] = d_positions[k + thread_index].x;

  ds_positions[3*thread_index+1] = d_positions[k + thread_index].y;

  ds_positions[3*thread_index+2] = d_positions[k + thread_index].z;

  ds_weights[thread_index] = d_weights[k + thread_index];

  ds_radii[thread_index] = d_radii[k + thread_index];

 }

__syncthreads();

for(int j = 0; j != BLOCK_SIZE && k < number_of_terms; ++j, ++k)

 {

  float v = 6.28318530717959*dot(f_coord, make_float3(ds_positions[3*j],ds_positions[3*j+1],ds_positions[3*j+2]));

  float term = ds_weights[j]*fourier_transform_sph(r*ds_radii[j]);

  sum.x += term*cos(v);

  sum.y += term*(-sin(v));

 }

 __syncthreads();

}

See if that helps.

Peter

acorrigan · May 30, 2007, 3:52am

Using that modification the code no longer crashes. However, it still doesn’t work since the image is just black. Tomorrow, I’ll try allocating shared memory using the third parameter inside of <<< >>>, instead of manually sizing the arrays, just in case that works. Meanwhile, any further input is appreciated :)

acorrigan · May 30, 2007, 1:00pm

That didn’t work either, even when I just stage one of the arrays.

acorrigan · June 12, 2007, 4:54am

Since 0.9 came out I gave it another shot, and in case anyone is interested in what happened, I can describe what the problem actually ended up being: In the kernel, before the block of code I posted, I had a conditional return that would return for one particular column of threads. That messes up loading shared memory, since for all of the blocks containing that column of threads, a column’s worth of shared memory wouldn’t have been set to anything from global memory (since the thread that was supposed to do this had already returned), and so then garbage would be read in during the second loop.

Topic		Replies	Views
Using Shared Memory in CUDA C/C++ Technical Blog	36	1969	October 8, 2020
Please help with __shared__ memory different usage than in samples CUDA Programming and Performance	30	3309	January 10, 2010
Performance issues on memory transfer CUDA Programming and Performance	13	12981	November 26, 2010
How to use all 16KB shared memory CUDA Programming and Performance	39	19309	April 1, 2010
Some confusion on using shared memory. CUDA Programming and Performance	26	9091	June 2, 2009
__syncthreads and shared memory CUDA Programming and Performance	21	4294	June 15, 2011
why result varied based on different number of threads per block? CUDA Programming and Performance	8	1936	March 1, 2011
Dynamically allocated shared memory CUDA Programming and Performance	5	3920	February 12, 2009
Slow Performance CUDA Programming and Performance	19	8126	November 24, 2008
Shared memory mechanism CUDA Programming and Performance	11	9977	January 8, 2008

help getting shared memory working

Related topics