I am a beginer in CUDA.
When I try to modify my finite element program to cuda-enabled, I am facing a problem.
When computing the internal force on each node, I should loop on each element and compute the contribution of this element to each node.
Typically, for example, an 8-node brick element will generate 8 contributions on each node. To compution the total force on each node,
all of the contribution to this node should be summarized together.
loop on each element
//compute the node force contribution from current element
FORCE_ELEM[1]=
FORCE_ELEM[2]=
FORCE_ELEM[3]=
....
FORCE_ELEM[8]
//get the node number of each corner of this element
N1=NC[1][ELEM]
N2=NC[2][ELEM]
N3=NC[3][ELEM]
...
N8=NC[8][ELEM]
//summarize them to node force
FORCE[N1]+=FORCE_ELEM[1]
FORCE[N2]+=FORCE_ELEM[2]
FORCE[N3]+=FORCE_ELEM[3]
...
FORCE[N8]+=FORCE_ELEM[8]
end loop
FORCE is the node force array in GPU global memory
When putting the above code into cuda GPU computation kernel,
obviously, the summarization of node force will encounter writing conflict.
I have read the sample program histogram, which also has this problem.
But in that sample, the result array is very small, 64xsingle-byte or 256xdouble-byte .
As a result, it can be stored into shared memory easily.
But in my codes, the FORCE array is very large, usually millionxdouble-precision-float.
Obviously, the method in historam sample code can not be used directly.
Also, for each element, the result array FORCE is very sparse, only 8 node force need to be summarized.
The method used in sample code historgram is not suitable to my program. I dont think it is good that storing a very large arry FORCE every thread even if every wrap.
I may use extra global memory to store nodal force in each element for each thread or wrap. But I dont want so large extra memory for them.
Can any one give me some suggestions to solve this problem?
The following is the demo code simmilar to my real program.
[codebox]
#include <stdio.h>
#define BLOCK_NUM 2
#define THREAD_NUM 1
#define NUM_ELE 4
#define NUM_NOD 18
#define NC(j,N) *(NC+(N-1)*8+j-1)
#define FORCE(N) *(FORCE+N-1)
global void internal_kernel(int* , float *);
int main(void)
{
int NC[8*NUM_ELE];
float FORCE[18];
NC(1,1)=1 ;
NC(2,1)=3 ;
NC(3,1)=9 ;
NC(4,1)=7 ;
NC(5,1)=2 ;
NC(6,1)=4 ;
NC(7,1)=10 ;
NC(8,1)=8 ;
NC(1,2)=3 ;
NC(2,2)=5 ;
NC(3,2)=11 ;
NC(4,2)=9 ;
NC(5,2)=4 ;
NC(6,2)=6 ;
NC(7,2)=12 ;
NC(8,2)=10 ;
NC(1,3)=7 ;
NC(2,3)=9 ;
NC(3,3)=15 ;
NC(4,3)=13 ;
NC(5,3)=8 ;
NC(6,3)=10 ;
NC(7,3)=16 ;
NC(8,3)=14 ;
NC(1,4)=9 ;
NC(2,4)=11 ;
NC(3,4)=17 ;
NC(4,4)=15 ;
NC(5,4)=10 ;
NC(6,4)=12 ;
NC(7,4)=18 ;
NC(8,4)=16 ;
int *dNC;
float *dFORCE;
cudaMalloc( (void**)&dNC,NUM_ELE8sizeof(int));
cudaMalloc( (void**)&dFORCE,NUM_NOD*sizeof(float));
cudaMemcpy(dNC,NC,NUM_ELE8sizeof(int),cudaMemcpyHostToDevi
ce);
// call compuation kernel first time
// initialize the force first
cudaMemset(dFORCE,0,NUM_NOD*sizeof(float));
//computing internal force
internal_kernel<<<BLOCK_NUM,THREAD_NUM>>>(dNC,dFORCE);
cudaMemcpy(FORCE,dFORCE,NUM_NOD*sizeof(float),cudaMemcpyDevi
ceToHost);
for(int i=1;i<=NUM_NOD;i++)
printf(“NOD %d %f\n”,i,FORCE(i));
cudaFree(dNC);
cudaFree(dFORCE);
return 0;
}
global_ void internal_kernel(int* NC, float *FORCE)
{
const int tid=threadIdx.x;
const int bid= blockIdx.x;
const int THREADS=blockDim.x;
const int BLOCKS=gridDim.x;
float FORCE_ELE[8];
//compute internal force
for(int NELE=bidTHREADS+tid+1;NELE<=NUM_ELE;NELE+=BLOCKSTHREADS)
{
for(int j=1;j<=8;j++)
FORCE_ELE[j-1]=NELE10.0+j1.0;
for(int j=1;j<=8;j++)
{
int NOD;
NOD=NC(j,NELE);
FORCE(NOD)=FORCE(NOD)+FORCE_ELE[j-1];
// printf(“NELE: %1d J:%1d NOD:%2d \t FORCE_ELE:%10.1f\t FORCE:%10.1f\n”,NELE,j,NOD,FORCE_ELE[j-1],FORCE(NOD));
}
}
return ;
}[/codebox]