I have a kernel that works using global memory. I’ve made an attempt at making it use shared memory, but it’s not working. I’m hoping that someone could help me out and point out what I’m doing wrong.
The following are the portions of the kernel which differ between the two versions.
Using global memory I have:
for(unsigned int k = 0; k != number_of_terms; ++k)
{
float v = 6.28318530717959*dot(f_coord, d_positions[k]);
float term = d_weights[k]*fourier_transform_sph(r*d_radii[k]);
sum.x += term*cos(v);
sum.y += term*(sin(v));
}
Using shared memory I have:
int thread_index = blockDim.x*threadIdx.y + threadIdx.x;
__shared__ float3 ds_positions[BLOCK_SIZE];
__shared__ float ds_weights[BLOCK_SIZE];
__shared__ float ds_radii[BLOCK_SIZE];
for(unsigned int k = 0; k < number_of_terms;)
{
if (k+thread_index < number_of_terms)
{
ds_positions[thread_index] = d_positions[k + thread_index];
ds_weights[thread_index] = d_weights[k + thread_index];
ds_radii[thread_index] = d_radii[k + thread_index];
}
__syncthreads();
for(int j = 0; j != BLOCK_SIZE && k < number_of_terms; ++j, ++k)
{
float v = 6.28318530717959*dot(f_coord, ds_positions[j]);
float term = ds_weights[j]*fourier_transform_sph(r*ds_radii[j]);
sum.x += term*cos(v);
sum.y += term*(sin(v));
}
__syncthreads();
}
Note, in my code BLOCK_SIZE is the number of threads per block. I think that in the SDK examples it may be the square root of this.
The difference between the two when run is that the first version works perfectly, and the second version crashes.
I experimented further with the code to try to understand what’s going on. It turns out that:

If I only stage positions then it works  kind of, the results change frame to frame, which shouldn’t happen.

If I only stage radii then it works perfectly.

If I only stage weights it will sort of flicker at first with something related to the correct results but eventually will go black

If I stage both positions and radii then it crashes.
Thanks for any help.