Regarding a simple code slow on CUDA

Mobeen · October 20, 2011, 12:15pm

Hi all,

I am trying to port my CPU code to CUDA. The code has force calculation and integration. Currently, to start of, I thought I would just put the force calculation to CUDA first and then the integration later. In the force calculaiton there is external force calc. and internal force calc. so I conc. on external force only and so this is the kernel for calculating the external force,

__global__ void 

CalcExternalForceAndDisplacement(const int N,const float damping,const float3 gravity, float4* pVel, float4* pXi, float4* pX, float3* out_F, float3* out_U) 

{

   int idx = blockIdx.x * blockDim.x + threadIdx.x;

   if(idx>N)

      return;

float4 Xi  =  pXi[idx];

   float4 X   =   pX[idx];

   float4 vel = pVel[idx];

   out_F[idx] = make_float3(gravity.x - damping*vel.x,

			    gravity.y - damping*vel.y,

			    gravity.z - damping*vel.z);

   out_U[idx] = make_float3(X.x-Xi.x, X.y-Xi.y, X.z-Xi.z);

}

However, to my surprise just this code is running at almost half of the speed compared to CPU code which is calculating a lot of things (calc. Jacobians and matrix multiplications) in addition to the external force and displacements. CPU code is running at ~720 fps while the above CUDA code is running at ~360 fps. Could anyone tell me why is this so?

My suspicion is that I am under-utilizzing my hardware i.e. not enough threads.

My exec. cofig. is as follows, N is the number of points , BLOCK_SIZE is set as 8 at the moment

CalcExternalForceAndDisplacement<<<make_uint3(N,1,1),make_uint3(N/BLOCK_SIZE,1,1)>>>(N, damping, gravity, pVel, pXi, pX,  F, U);

[\code]

I also added in a texture variant as follows but it still performs the same.

[code]

__global__ void 

CalcExternalForceAndDisplacementTexture(const int N,const float damping,const float3 gravity, float3* out_F, float3* out_U) 

{

   int idx = blockIdx.x * blockDim.x + threadIdx.x;

   if(idx>N)

      return;

float4  Xi = (tex1Dfetch( texXi, idx));

   float4   X = (tex1Dfetch( texX,  idx));

   float4 vel = (tex1Dfetch( texV,  idx));

   out_F[idx] = make_float3(gravity.x - damping*vel.x,

    			    gravity.y - damping*vel.y,

 			    gravity.z - damping*vel.z );

   out_U[idx] = make_float3(X.x-Xi.x, X.y-Xi.y, X.z-Xi.z );

}

void ComputeForcesCUDATexture(const int N, const float damping, const float3 gravity, float4* pVel, float4* pXi, float4* pX, float3* F, float3* U)

{

   cudaChannelFormatDesc channelDesc3 = cudaCreateChannelDesc<float4>();

   cudaBindTexture( 0,  texXi, pXi, channelDesc3);

   cudaBindTexture( 0,  texX, pX, channelDesc3);

   cudaBindTexture( 0,  texV, pVel, channelDesc3);

CalcExternalForceAndDisplacementTexture<<<make_uint3(N,1,1),make_uint3(N/BLOCK_SIZE,1,1)>>>(N, damping, gravity, F, U);

   cudaUnbindTexture( texXi);

   cudaUnbindTexture( texX);	 

   cudaUnbindTexture( texV);

}

Can anyone shed some light on this please.

Thanks,

Mobeen

Mobeen · October 20, 2011, 1:36pm

OK just to get more insight, I ran the visual profiler and it tells me that my instruction throughput is 0.0454607
so I think I do not have enough threads? is it so?

mfatica · October 20, 2011, 1:43pm

Your block size is way too small.
Try something around 192-256.

Mobeen · October 20, 2011, 1:57pm

hi Thanks for the reply. Currently my N is 250.

Am I specifing my execution config correctly?

<<<make_int3(N,1,1),make_int3(N/BLOCK_SIZE,1,1)>>>

After looking at a couple of tutorials I have changed it to this,

int BLOCKS = N/THREADS_PER_BLOCK;

make_uint3(BLOCKS,1,1),make_uint3(THREADS_PER_BLOCK,1,1)>>>

But this is giving me ~300 fps even slower. Does this mean that for the current data size I am not having enough threads?

EDIT: OK after another experiment definitely, the GPU is not having enough data to show its power. I changed the data size to twice and then CPU completely halted whereas GPU is still standing at ~200 fps so it is indeed performing better esp when the data size is large.

Mobeen · October 21, 2011, 7:21am

I noticed from the visual debugger that around 60% of the kernel time is memcpy so I searched online for fast methods for gpu transfer so I cam across using PBO and so I wrote another function to use the PBO instead as follows,

void CUDA_PBO() {

   float4* devX;

   float4* devV;

   float3* devF;

   float3* devU;

cudaGraphicsMapResources( 4, &resID[0], NULL );	

   cudaGraphicsResourceGetMappedPointer( (void**)&devX, &size4, resID[0]);		

   cudaGraphicsResourceGetMappedPointer( (void**)&devV, &size4, resID[1]);

   cudaGraphicsResourceGetMappedPointer( (void**)&devF, &size3, resID[2]);

   cudaGraphicsResourceGetMappedPointer( (void**)&devU, &size3, resID[3]);

      ComputeForcesCUDATexture(total_points, damping, g, devV, dXi, devX, devF, devU);

   cudaGraphicsUnmapResources( 4, &resID[0], NULL );

   //copy results to the host 

   cudaMemcpy(&F[0].x, &devF[0].x, size3, cudaMemcpyDeviceToHost); 

   cudaMemcpy(&U[0].x, &devU[0].x, size3, cudaMemcpyDeviceToHost);

}

In my code now I have three modes, CPU, GPU, GPU_PBO and I can toggle btw them using the spacebar key. At the first run, the CUDA code (given in the last reply) runs at the same performance (~300 fps) as before, followed by PBO code which is even slower (~150 fps). However the strange thing that is happening is that as soon as I toggle to the GPU code after using the GPU PBO mode, the frame rate becomes 650 for the old CUDA code (almost 2x). Could anyone suggest why this is happening.

roointan · October 24, 2011, 8:40pm

I’m starting to do cuda programming.
I guess that this is because the first time just in time compiler compiles it, and the subsequent calls won’t need the compiling step. maybe.
(I sent you a private message, make sure to check it.)

pasoleatis · October 25, 2011, 8:51am

If you measure so short time make sure you measure only the specific action or use a lots of iterations. The card needs some initial “booting”, make sure you do not measure also that.

Topic		Replies	Views
First CUDA application, performance issues Looking for anyone who can give advice on how to speed up CUDA Programming and Performance	4	733	March 27, 2011
CUDA slower than CPU? CUDA Programming and Performance	7	995	August 18, 2023
CUDA trouble CUDA Programming and Performance	3	1043	March 19, 2013
Help on fixing some poor performances (rookie) CUDA Programming and Performance	10	7274	November 28, 2007
cuda gpu slower than cpu CUDA Programming and Performance	2	1133	May 1, 2012
CUDA slower than CPU Help me please... CUDA Programming and Performance	2	5746	February 8, 2010
Low performance. whats wrong ? CUDA Programming and Performance	3	2878	May 6, 2009
slow speed of cuda code CUDA Programming and Performance	4	5335	October 30, 2011
GPU vs. CPU GPU is always much slower CUDA Programming and Performance	1	10350	June 5, 2009
Here are my timing results, not impressive. Help. CUDA Programming and Performance	5	7080	January 30, 2008

Regarding a simple code slow on CUDA

Related topics