Regarding a simple code slow on CUDA

Hi all,

I am trying to port my CPU code to CUDA. The code has force calculation and integration. Currently, to start of, I thought I would just put the force calculation to CUDA first and then the integration later. In the force calculaiton there is external force calc. and internal force calc. so I conc. on external force only and so this is the kernel for calculating the external force,

__global__ void 

CalcExternalForceAndDisplacement(const int N,const float damping,const float3 gravity, float4* pVel, float4* pXi, float4* pX, float3* out_F, float3* out_U) 


   int idx = blockIdx.x * blockDim.x + threadIdx.x;



float4 Xi  =  pXi[idx];

   float4 X   =   pX[idx];

   float4 vel = pVel[idx];

   out_F[idx] = make_float3(gravity.x - damping*vel.x,

			    gravity.y - damping*vel.y,

			    gravity.z - damping*vel.z);

   out_U[idx] = make_float3(X.x-Xi.x, X.y-Xi.y, X.z-Xi.z);


However, to my surprise just this code is running at almost half of the speed compared to CPU code which is calculating a lot of things (calc. Jacobians and matrix multiplications) in addition to the external force and displacements. CPU code is running at ~720 fps while the above CUDA code is running at ~360 fps. Could anyone tell me why is this so?

My suspicion is that I am under-utilizzing my hardware i.e. not enough threads.

My exec. cofig. is as follows, N is the number of points , BLOCK_SIZE is set as 8 at the moment

CalcExternalForceAndDisplacement<<<make_uint3(N,1,1),make_uint3(N/BLOCK_SIZE,1,1)>>>(N, damping, gravity, pVel, pXi, pX,  F, U);


I also added in a texture variant as follows but it still performs the same.


__global__ void 

CalcExternalForceAndDisplacementTexture(const int N,const float damping,const float3 gravity, float3* out_F, float3* out_U) 


   int idx = blockIdx.x * blockDim.x + threadIdx.x;



float4  Xi = (tex1Dfetch( texXi, idx));

   float4   X = (tex1Dfetch( texX,  idx));

   float4 vel = (tex1Dfetch( texV,  idx));

   out_F[idx] = make_float3(gravity.x - damping*vel.x,

    			    gravity.y - damping*vel.y,

 			    gravity.z - damping*vel.z );

   out_U[idx] = make_float3(X.x-Xi.x, X.y-Xi.y, X.z-Xi.z );


void ComputeForcesCUDATexture(const int N, const float damping, const float3 gravity, float4* pVel, float4* pXi, float4* pX, float3* F, float3* U)


   cudaChannelFormatDesc channelDesc3 = cudaCreateChannelDesc<float4>();

   cudaBindTexture( 0,  texXi, pXi, channelDesc3);

   cudaBindTexture( 0,  texX, pX, channelDesc3);

   cudaBindTexture( 0,  texV, pVel, channelDesc3);

CalcExternalForceAndDisplacementTexture<<<make_uint3(N,1,1),make_uint3(N/BLOCK_SIZE,1,1)>>>(N, damping, gravity, F, U);

   cudaUnbindTexture( texXi);

   cudaUnbindTexture( texX);	 

   cudaUnbindTexture( texV);


Can anyone shed some light on this please.



OK just to get more insight, I ran the visual profiler and it tells me that my instruction throughput is 0.0454607
so I think I do not have enough threads? is it so?

Your block size is way too small.
Try something around 192-256.

hi Thanks for the reply. Currently my N is 250.

Am I specifing my execution config correctly?


After looking at a couple of tutorials I have changed it to this,



But this is giving me ~300 fps even slower. Does this mean that for the current data size I am not having enough threads?

EDIT: OK after another experiment definitely, the GPU is not having enough data to show its power. I changed the data size to twice and then CPU completely halted whereas GPU is still standing at ~200 fps so it is indeed performing better esp when the data size is large.

I noticed from the visual debugger that around 60% of the kernel time is memcpy so I searched online for fast methods for gpu transfer so I cam across using PBO and so I wrote another function to use the PBO instead as follows,

void CUDA_PBO() {

   float4* devX;

   float4* devV;

   float3* devF;

   float3* devU;

cudaGraphicsMapResources( 4, &resID[0], NULL );	

   cudaGraphicsResourceGetMappedPointer( (void**)&devX, &size4, resID[0]);		

   cudaGraphicsResourceGetMappedPointer( (void**)&devV, &size4, resID[1]);

   cudaGraphicsResourceGetMappedPointer( (void**)&devF, &size3, resID[2]);

   cudaGraphicsResourceGetMappedPointer( (void**)&devU, &size3, resID[3]);

      ComputeForcesCUDATexture(total_points, damping, g, devV, dXi, devX, devF, devU);

   cudaGraphicsUnmapResources( 4, &resID[0], NULL );

   //copy results to the host 

   cudaMemcpy(&F[0].x, &devF[0].x, size3, cudaMemcpyDeviceToHost); 

   cudaMemcpy(&U[0].x, &devU[0].x, size3, cudaMemcpyDeviceToHost);


In my code now I have three modes, CPU, GPU, GPU_PBO and I can toggle btw them using the spacebar key. At the first run, the CUDA code (given in the last reply) runs at the same performance (~300 fps) as before, followed by PBO code which is even slower (~150 fps). However the strange thing that is happening is that as soon as I toggle to the GPU code after using the GPU PBO mode, the frame rate becomes 650 for the old CUDA code (almost 2x). Could anyone suggest why this is happening.

I’m starting to do cuda programming.
I guess that this is because the first time just in time compiler compiles it, and the subsequent calls won’t need the compiling step. maybe.
(I sent you a private message, make sure to check it.)

If you measure so short time make sure you measure only the specific action or use a lots of iterations. The card needs some initial “booting”, make sure you do not measure also that.