array calculation in cuda

In my program I need to calculate gravity for every created particle

with every other particle.

However I am not able to make more than 256 particles or the program hangs.

I think it has something to do with blocks in the kernel because I have to

read the data from the array for every particle in every thread.


updateUniverse(uint vboOldPos, uint vboNewPos, uint oldVel, uint newVel, int items)


    float4 *oldPos, *newPos;

    CUDA_SAFE_CALL(cudaGLMapBufferObject((void**)&oldPos, vboOldPos));

    CUDA_SAFE_CALL(cudaGLMapBufferObject((void**)&newPos, vboNewPos));


    float3 *vDOld;

   float3 *vDNew;


	CUDA_SAFE_CALL(cudaGLMapBufferObject((void**)&vDOld, oldVel));

	CUDA_SAFE_CALL(cudaGLMapBufferObject((void**)&vDNew, newVel));



    int numThreads = min(512, items);

    int numBlocks = (int) ceil(items / (float) numThreads);    


	updateUniverseD<<< numBlocks, numThreads, sizeof(float) * items * 3 >>>(oldPos, newPos, vDOld, vDNew, items);

	// check if kernel invocation generated an error

    CUT_CHECK_ERROR("Kernel execution failed");


    CUDA_SAFE_CALL(cudaGLUnmapBufferObject((uint) vboOldPos));

    CUDA_SAFE_CALL(cudaGLUnmapBufferObject((uint) vboNewPos));


  CUDA_SAFE_CALL(cudaGLUnmapBufferObject((uint) oldVel));

   CUDA_SAFE_CALL(cudaGLUnmapBufferObject((uint) newVel));


Part from


void recalcMagD(float4 * pos, float4 * posnew, float3 *posVbo, int i, int items)




	int ii=0;

	double v1x,v2x,v1z,v2z, vx ,vz, vmass, vangle, vmag, ox ,oz, omass;

	double bot_distance;

	double grav, gangle;


	vx = pos[i].x;

	vz = pos[i].z;

	vmass = posVbo[i].x;

	vmag = posVbo[i].y;

	vangle = posVbo[i].z;

	for (ii=0; ii < items; ii++)


  if(i != ii)


  	ox = pos[ii].x;

  	oz = pos[ii].z;

  	omass = posVbo[ii].x;


  	bot_distance = (sqrtf(powf(ox - vx, 2)+

            powf(oz - vz, 2)));

   grav = (((GRAV * vmass * omass)/powf(bot_distance, 2))/vmass);

    gangle = atan2f(oz - vz , ox - vx);


    v1x = vx+(vmag * cosf(vangle));

    v1z = vz+(vmag * sinf(vangle));


    v2x = vx+(grav * cosf(gangle));

    v2z = vz+(grav * sinf(gangle));


    vx = vx - ((v1x + v2x))/2.0;

    vz = vz - ((v1z + v2z))/2.0;


    vmag = sqrtf(powf(vx , 2) + powf(vz, 2)) * 2.0;

    vangle = atan2f(vz, vx);



	posVbo[i] = make_float3(vmass, vmag, vangle);


	posnew[i] = make_float4((pos[i].x + (vmag * cosf(vangle))), 0.0, (pos[i].z + (vmag * sinf(vangle))), 0.0);




__global__ void

updateUniverseD(float4 * posVboOld, float4* posVboNew, float3 *vo, float3 *vn, int items)


  int ii = __mul24(blockIdx.x,blockDim.x) + threadIdx.x;


  extern __shared__ float3 shared[];

	shared[ii] = vo[ii]; 




	recalcMagD(posVboOld, posVboNew, shared, ii, items)


	vn[ii] = shared[ii];



I’m pretty confused, and any help is highly appreciated

May b:

Instead of ,


int numBlocks = (int) ceil((float)items /  numThreads);

And, therez no need for __syncthreads() at the end of your kernel.

Do you get any warnings while compilation?

I made the suggested changes, but it didn’t help,

It seems if there are more than a certain number of

particles, i get a Segmentation fault on last string of the first block

in Debug emulation mode.

and I do not get any warnings during compilation

To clear up:

I have 2 input arrays that every string in every block accesses in a for loop
My question is – does that cause bank interference? and if so how would I fix that?

What is a “string”? Are you calling a thread as a string??


I see that your usage of “shared memory” is NOT optimal.

You are allocating a big array “items*sizeof(float)*3” for shared memory. But each block uses only a portion of it (num threads amount of that memory). The remaining memory is just wasted. That is really bad. And, what more – in your recalc…() function, you are accessing the entire “shared memory” – The FOR loop that runs from “0 to items”. As the shared memory array is NOT fully populated – you might be accessing garbage values here. So, your computation wont yield correct results.

Also, I dont understand what you mean by “256” particiles. DOes this refer to “items”. In case of 256 items, you will be spawning only one block. In such a case, the problem that I was discussing in the above paragraph will NOT show up. But it would be better to write correct code.

Also, How about replacing “items*sizeof(float)3" by "itemssizeof(float3)” ???

This is how I do a N^2 particle-particle computation with efficient use of shared memory. I believe the Nbody demo in the SDK does something similar, and it is also documented in the latest GPU Gems book.

The code below is designed as pseudocode and doesn’t perform boundary checks before reading/writing to arrays. Each block loads in particles in a sliding window and then each particle’s force is computed against all the others using the shared memory broadcast method.

__global__ void calcforces(int N, float4* pos, float4* force)


    // each thread caclculates the net force on one single particle

    int pidx = blockIdx.x*blockDim.x + threadIdx.x

    extern __shared__ float3 pos_shared[]; // initialized to blockDim.x * sizeof(float3) size

   float4 force = make_float4(0,0,0,0);

   // sliding window, each thread participates in a coalesced read

    for (int i = 0; i < N; i += blockDim.x)


      float4 p = pos[pidx];


      pos_shared[threadIdx.x].x = p.x; 

      pos_shared[threadIdx.x].y = p.y; 

      pos_shared[threadIdx.x].z = p.z;


     for (int j = 0; j < blockDim.x; j++)


         float3 pj = pos_shared[j];

         // calculate force between p and pj, sum into force



   force[pidx] = force;


Well, that’s a bit longer than I thought it would be. Hopefully you get the picture. Let me know if you have any questions.

I get the concept, but I keep getting a segmentation fault on the last thread in emudebug mode… I can’t figure out why.

Well, if the number of particles is not a multiple of the block size, then the last threads of the block are going to write past the end of the array = seg fault.

It does not matter whether it is a multiple or not, and at the moment Im trying it so there is only one block, and I set boundaries so it is not out of bounds.

Using the debug I get Segmentation fault on the last thread (Thread 65 when theres 64 items)
It comes up right after the second __syncthreads();