How to avoid writing conflcit on writing to the same location on global memory?

I am a beginer in CUDA.

When I try to modify my finite element program to cuda-enabled, I am facing a problem.

When computing the internal force on each node, I should loop on each element and compute the contribution of this element to each node.

Typically, for example, an 8-node brick element will generate 8 contributions on each node. To compution the total force on each node,

all of the contribution to this node should be summarized together.

loop on each element

//compute the node force contribution from current element

FORCE_ELEM[1]=

FORCE_ELEM[2]=

FORCE_ELEM[3]=

....

FORCE_ELEM[8]

//get the node number of each corner of this element

N1=NC[1][ELEM]

N2=NC[2][ELEM]

N3=NC[3][ELEM]

 ...

N8=NC[8][ELEM]

//summarize them to node force

FORCE[N1]+=FORCE_ELEM[1]

FORCE[N2]+=FORCE_ELEM[2]

FORCE[N3]+=FORCE_ELEM[3]

...

FORCE[N8]+=FORCE_ELEM[8]

end loop

FORCE is the node force array in GPU global memory

When putting the above code into cuda GPU computation kernel,

obviously, the summarization of node force will encounter writing conflict.

I have read the sample program histogram, which also has this problem.

But in that sample, the result array is very small, 64xsingle-byte or 256xdouble-byte .

As a result, it can be stored into shared memory easily.

But in my codes, the FORCE array is very large, usually millionxdouble-precision-float.

Obviously, the method in historam sample code can not be used directly.

Also, for each element, the result array FORCE is very sparse, only 8 node force need to be summarized.

The method used in sample code historgram is not suitable to my program. I dont think it is good that storing a very large arry FORCE every thread even if every wrap.

I may use extra global memory to store nodal force in each element for each thread or wrap. But I dont want so large extra memory for them.

Can any one give me some suggestions to solve this problem?

The following is the demo code simmilar to my real program.

[codebox]

#include <stdio.h>

#define BLOCK_NUM 2

#define THREAD_NUM 1

#define NUM_ELE 4

#define NUM_NOD 18

#define NC(j,N) *(NC+(N-1)*8+j-1)

#define FORCE(N) *(FORCE+N-1)

global void internal_kernel(int* , float *);

int main(void)

{

int NC[8*NUM_ELE];

float FORCE[18];

NC(1,1)=1 ;

NC(2,1)=3 ;

NC(3,1)=9 ;

NC(4,1)=7 ;

NC(5,1)=2 ;

NC(6,1)=4 ;

NC(7,1)=10 ;

NC(8,1)=8 ;

NC(1,2)=3 ;

NC(2,2)=5 ;

NC(3,2)=11 ;

NC(4,2)=9 ;

NC(5,2)=4 ;

NC(6,2)=6 ;

NC(7,2)=12 ;

NC(8,2)=10 ;

NC(1,3)=7 ;

NC(2,3)=9 ;

NC(3,3)=15 ;

NC(4,3)=13 ;

NC(5,3)=8 ;

NC(6,3)=10 ;

NC(7,3)=16 ;

NC(8,3)=14 ;

NC(1,4)=9 ;

NC(2,4)=11 ;

NC(3,4)=17 ;

NC(4,4)=15 ;

NC(5,4)=10 ;

NC(6,4)=12 ;

NC(7,4)=18 ;

NC(8,4)=16 ;

int *dNC;

float *dFORCE;

cudaMalloc( (void**)&dNC,NUM_ELE8sizeof(int));

cudaMalloc( (void**)&dFORCE,NUM_NOD*sizeof(float));

cudaMemcpy(dNC,NC,NUM_ELE8sizeof(int),cudaMemcpyHostToDevi

ce);

// call compuation kernel first time

// initialize the force first

cudaMemset(dFORCE,0,NUM_NOD*sizeof(float));

//computing internal force

internal_kernel<<<BLOCK_NUM,THREAD_NUM>>>(dNC,dFORCE);

cudaMemcpy(FORCE,dFORCE,NUM_NOD*sizeof(float),cudaMemcpyDevi

ceToHost);

for(int i=1;i<=NUM_NOD;i++)

printf(“NOD %d %f\n”,i,FORCE(i));

cudaFree(dNC);

cudaFree(dFORCE);

return 0;

}

global_ void internal_kernel(int* NC, float *FORCE)

{

const int tid=threadIdx.x;

const int bid= blockIdx.x;

const int THREADS=blockDim.x;

const int BLOCKS=gridDim.x;

float FORCE_ELE[8];

//compute internal force

for(int NELE=bidTHREADS+tid+1;NELE<=NUM_ELE;NELE+=BLOCKSTHREADS)

{

for(int j=1;j<=8;j++)

FORCE_ELE[j-1]=NELE10.0+j1.0;

for(int j=1;j<=8;j++)

{

int NOD;

NOD=NC(j,NELE);

FORCE(NOD)=FORCE(NOD)+FORCE_ELE[j-1];

// printf(“NELE: %1d J:%1d NOD:%2d \t FORCE_ELE:%10.1f\t FORCE:%10.1f\n”,NELE,j,NOD,FORCE_ELE[j-1],FORCE(NOD));

}

}

return ;

}[/codebox]

[Q: how do you delete a post ?]

Check out the atomic instructions in section 3 of the CUDA Reference Manual. For int variables you can just use atomicAdd directly. For float variables you’ll need to implement some sort of lock using e.g. atomicCAS.

Thank you for your suggestion.

But I also noticed that the lock by atomic operation may bring large performance effects.

e.g. The following is an example from other people to realize the locking operation

[codebox]

device inline void atomicFloatAdd(float *address, float val)

{

  int i_val = __float_as_int(val);

  int tmp0 = 0;

  int tmp1;

while( (tmp1 = atomicCAS((int *)address, tmp0, i_val)) != tmp0)

  {

          tmp0 = tmp1;

          i_val = __float_as_int(val + __int_as_float(tmp1));

  }

}

device inline void atomicFloatAdd(float *address, float val)

{

  int i_val = __float_as_int(val);

  int tmp0 = 0;

  int tmp1;

while( (tmp1 = atomicCAS((int *)address, tmp0, i_val)) != tmp0)

  {

          tmp0 = tmp1;

          i_val = __float_as_int(val + __int_as_float(tmp1));

  }

}

[/codebox]

I have read a paper " Efficient nonlinear FEM for soft tissue modelling and its GPU implementation within the open source framework SOFA" from http://www.sofa-framework.org/docs/sofa-tled-isbms08.pdf

In page 6. Sec3.4, Kernel organisation,

It is suggested that

"by using 2 kernels can solve this problem. The rst kernel operates over elements in the model and computes the element

stresses.It then converts these into nodal force contributions, which are written to global memory.

The second kernel operates over nodes and reads the previously calculated element force contributions and sums them for each node."

But I can not understand how to do it in detail.

Anyone can help me or write several line demo program/

Thank you in advance

Perhaps you could best reverse your algorithm. Instead of computing the forces generated by each element and then scattering them to those neighbors, do the opposite. Visit each element and GATHER the forces you need from the neighbors. Then there is no multithread summation and you don’t need any locks or atomics at all.

The deficit of this strategy is that you have to read from a voxel multiple times, wasting bandwidth and maybe computation.

This is the basic strategy of particle force exchange in tools like molecular dynamics HOOMD.

Thank you for your comment.

You mean I should seach all of the elements when summarizing the force on each node?

But the element connectivity matrix NC is very sparse, the searching process wastes too much, I think.

Or else, I am wrong?

The following is the code:

[codebox]

global_ void internal_kernel(int* NC, float *FORCE_ELE)

{

const int tid=threadIdx.x;

const int bid= blockIdx.x;

const int THREADS=blockDim.x;

const int BLOCKS=gridDim.x;

//compute internal force

for(int NELE=bidTHREADS+tid+1;NELE<=NUM_ELE;NELE+=BLOCKSTHREADS)

{

for(int j=1;j<=8;j++)

(FORCE_ELE+(NELE-1)8+j-1)=NELE10.0+j1.0;

}

return ;

}

global void gather_kernel(int* NC, float* FORCE_ELE, float* FORCE)

{

const int tid=threadIdx.x;

const int bid= blockIdx.x;

const int THREADS=blockDim.x;

const int BLOCKS=gridDim.x;

for(int NOD=bidTHREADS+tid+1;NOD<=NUM_NOD;NOD+=BLOCKSTHREADS)

{

for(int NELE=1;NELE<=NUM_ELE;NELE++)

{

for(int j=1;j<=8;j++)

{

 N=NC(j,NELE);

 if(N==NOD) FORCE(N)=FORCE(N) + *(FORCE_ELE+(NELE-1)*8+j-1);

}

}

}

return ;

}

[/codebox]

You mean the finite element connections are not a regular pattern like voxels that you can just index into?
If the connections are sparse and unpredictable, your problem is even more similar to the force-exchange problems of particle simulation than I thought.

Mr. Anderson here on the forums (who will likely be reading this 30 seconds after I post since he seems to live here too…) is the expert at this. His very effective solution, which you can read about on the HOOMD page (and paper), is to precompute a neighbor list for each particle. Then for simulation, you just run down the neighbor list, accumulating forces. Using texture memory helps speed by minimizing bandwidth use.

In HOOMD the neighbor lists themselves are dynamic. If your finite element connections are static, this will work even better for you.

Thank you for your help.

I will first read then paper you give.

My finite element connections are static, so it sounds that the method you suggested should be better.

Thank you very much

I have read the paper roughly.

Though the connectivities between elements in my program are fixed, the computaion of the force is some different from the molecular dynamics simulation.

In MD, it seems that the force of the node is only related to the two nodes on the neighbour edge.(Sorry, I know little about MD). But in finite element , the force component from element should be computed according to the element state(usually according to the position and displacement of 8 nodes for a brick element, compute strain and stress distribution and then the force on node).

So I am unable to compute the force just scanning each node and each edge, but I must computing the force first through scanning all of the elements.

I want to try the following method.

First store the node force contribution from each element to global memory

And then summarized them together.

Two extra arrays are used. One is the force contribtion array .

FORCE_COMP[NODE] // the size of the 2nd dim is 4 in my case. it means one node can be shared by 4 elements to the maximum.

COUNT_COMP[NODE] // to store the counts of shared nodes

[codebox]

global void internal_force_kernel( int NUM_ELE, int* NC, int* COUNT_COMP, float* FORCE_COMP)

{

const int tid=threadIdx.x;

const int bid=blockIdx.x;

int NC_E[8];

for(NE=bidTHREADS+tid+1;NE<=NUM_ELE;NE+=THREADSBLOCKS)

{

//

for(int i=0;i<8;i++)

 NC_E[i]=NC[NE][i];

// compute the internal node force contribution from this element

FORCE_ELE[1]=

FORCE_ELE[2]=

....

FORCE_ELE[8]=

// stored it into global memory

for(int i=0;i<8;i++)

{

// get the node number

  NOD=NC_E[i];

// get the stored position

  ipos= atomAdd( NODE_SHARE[NOD],1);

// store it into temp force array in global memory

  FORCE_COMP[NOD][ipos]=FORCE_ELE[i];

}

}

return ;

}

global void gather_force_kernel(int NUM_NOD, int* COUNT_COMP, float* FORCE_COMP, float* FORCE)

{

const int tid=threadIdx.x;

const int bid=blockIdx.x;

for(NOD=bidTHREADS+tid+1;NE<=NUM_NOD;NOD+=THREADSBLOCKS)

{

FORCE[NOD]=0.0 ;

for(int i=0;i<NODE_SHARE[NOD];i++)

  FORCE[NOD]+=FORCE_COMP[NOD][i];

}

return ;

}

[/codebox]

Thus, the write conflict can be avoided.

But the above program has some disadvantages, it use scattered write on global memory, which leads to bad efficiency.(if compared with locking method, which one is better?)

it needs extra large storage space for temp force. And read & write the global memory two times

Is it suitable in GPU computation or how to improve the preformance further.

Thank you very much for any suggestions.

Yes, I think your strategy will work fine. It does indeed take two passes, but they are very easy passes.

Your strategy is likely correct. Kernel 1 has each node “deposit” force to its neighbors into it’s own unique mailbox.
The second kernel just loops through its own mailboxes and sums them.
No write conflicts, and no atomics needed.

However, now you have the implementation optimization to think about… efficient memory access.
If you just have some lists in device memory, accessing single elements at a time waste a lot of your memory bandwidth, since the smallest device memory transaction size is 16 words, and you’re only writing one word.

You may think this is nit-picking (worrying about small details) but it is extremely likely your final speed will be limited by memory bandwidth, so optimizing memory bandwidth can give a 2-10X speedup over simpler but inefficient access.

This is where CUDA becomes tricky… not getting an algorithm working, but rearranging it to fit the hardware’s memory behavior.

Are your element connections truly unpredictable, or are they in some pattern like a banded matrix? If they have structure, it may be efficient to write out your data in large chunks, then reshuffle the order later by using shared memory, but the actual plan for that kind of access would depend a lot on your data structure.

The cool thing is that FEM can be cast into a formulation where it just provides the forces for a MD simulation :) I’ll be working on getting this into HOOMD this summer, so I’ve thought about it a little bit.

Let me just state in simple terms here what the biggest difference is: In “normal” MD, a gather means I only need to compute each force twice and it is calculated between particles only. In FEM, the force is calculated from the entire element (usually tetrahedra, so 4 points) and each node may be connected to 12 (or more) other elements. In this case, the “gather” approach would mean loading in 4 particles x12 times and calculating each force 12 times!! Maybe there are enough FLOPS to waste for the latter, but the memory bandwidth wasted would be a killer.

You guys are already talking about a “mailbox” solution, which is indeed the same route I’ve thought about. With the static FEM mesh, you can even pre-calculate all the mailbox destinations and apply some kind of sort to get the most out of the G200 memory coalescing that you can.

I’ve got some “crazy” ideas to try and do mailboxes in shared memory within a block that has some (or all) of the connected elements to reduce global memory bandwidth needs. But I haven’t fully thought that one out yet. This would also have to take advantage of some kind of pre-computed sort to improve the locality of elements on blocks.

Sorry I don’t have any better suggestions. I haven’t had time to really think about it in detail yet. It it actually a very challenging problem to cast into CUDA.

Mr. Anderson, you’re slipping. It took you 6 hours to dive into the thread. Sleeping is no excuse!

mybiandu, you’re new to CUDA, and may be a little overwhelmed by this sudden discussion of coalesced memory, mailboxes, shared memory shuffling, and multiple kernels. That’s all getting deep into the details of GPU coding. But for your first kernels, you might try those straighforward loops just like you planned. GPUs are fast enough that even that code may give quite decent speedups over the CPU (2X? 5X? 10X?). But then when you dive deeper later and optimize memory use you may jump up to 20X 50X 100X speeds.

Mr. Anderson and SPWorley,
Thank you very much for your good advice. You are so kindly.

I will first let the program can run correctly and then improve it at my best.

Thankyou all.