Hi all,

I am trying to port my CPU code to CUDA. The code has force calculation and integration. Currently, to start of, I thought I would just put the force calculation to CUDA first and then the integration later. In the force calculaiton there is external force calc. and internal force calc. so I conc. on external force only and so this is the kernel for calculating the external force,

```
__global__ void
CalcExternalForceAndDisplacement(const int N,const float damping,const float3 gravity, float4* pVel, float4* pXi, float4* pX, float3* out_F, float3* out_U)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if(idx>N)
return;
float4 Xi = pXi[idx];
float4 X = pX[idx];
float4 vel = pVel[idx];
out_F[idx] = make_float3(gravity.x - damping*vel.x,
gravity.y - damping*vel.y,
gravity.z - damping*vel.z);
out_U[idx] = make_float3(X.x-Xi.x, X.y-Xi.y, X.z-Xi.z);
}
```

However, to my surprise just this code is running at almost half of the speed compared to CPU code which is calculating a lot of things (calc. Jacobians and matrix multiplications) in addition to the external force and displacements. CPU code is running at ~720 fps while the above CUDA code is running at ~360 fps. Could anyone tell me why is this so?

My suspicion is that I am under-utilizzing my hardware i.e. not enough threads.

My exec. cofig. is as follows, N is the number of points , BLOCK_SIZE is set as 8 at the moment

```
CalcExternalForceAndDisplacement<<<make_uint3(N,1,1),make_uint3(N/BLOCK_SIZE,1,1)>>>(N, damping, gravity, pVel, pXi, pX, F, U);
[\code]
I also added in a texture variant as follows but it still performs the same.
[code]
__global__ void
CalcExternalForceAndDisplacementTexture(const int N,const float damping,const float3 gravity, float3* out_F, float3* out_U)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if(idx>N)
return;
float4 Xi = (tex1Dfetch( texXi, idx));
float4 X = (tex1Dfetch( texX, idx));
float4 vel = (tex1Dfetch( texV, idx));
out_F[idx] = make_float3(gravity.x - damping*vel.x,
gravity.y - damping*vel.y,
gravity.z - damping*vel.z );
out_U[idx] = make_float3(X.x-Xi.x, X.y-Xi.y, X.z-Xi.z );
}
void ComputeForcesCUDATexture(const int N, const float damping, const float3 gravity, float4* pVel, float4* pXi, float4* pX, float3* F, float3* U)
{
cudaChannelFormatDesc channelDesc3 = cudaCreateChannelDesc<float4>();
cudaBindTexture( 0, texXi, pXi, channelDesc3);
cudaBindTexture( 0, texX, pX, channelDesc3);
cudaBindTexture( 0, texV, pVel, channelDesc3);
CalcExternalForceAndDisplacementTexture<<<make_uint3(N,1,1),make_uint3(N/BLOCK_SIZE,1,1)>>>(N, damping, gravity, F, U);
cudaUnbindTexture( texXi);
cudaUnbindTexture( texX);
cudaUnbindTexture( texV);
}
```

Can anyone shed some light on this please.

Thanks,

Mobeen