Help on fixing some poor performances (rookie)

Ailleur · November 27, 2007, 7:32pm

Hi, this is my first post, being written a couple of hours after i started playing with CUDA, so my question will no doubt be trivial, sorry in advance!

I was quite amazed at the particule simulator’s performance and having written my own in plain old C, i started wondering if i could translate it to use CUDA.

To get myself familiar with the tools, i started writing a simple program that would take 2 vectors and do some math between them.

I write both vector once to device memory then loop the kernel 10000 times and compare this execution on both gpu and cpu to find out that the execution time is pretty much the same.

Here is the code:

#define THREAD_CNT 256

#define NUM_PARTICULES 104200

const int BLOCK_CNT = ceil((float)NUM_PARTICULES/(float)THREAD_CNT);

float4 partPos[NUM_PARTICULES];

float4 partForces[NUM_PARTICULES];

#define DT 0.001

__global__ void integrate( float4* g_data, float4* g_force,float dt )

{

	int index = __mul24(blockIdx.x,blockDim.x) + threadIdx.x;

	g_data[index].z+=g_force[index].z;

}

int main(int argc, char** argv)

{

	

	initParticules();

	cudaInit();

	

	unsigned int timer0;

	CUT_SAFE_CALL( cutCreateTimer( &timer0));

	CUT_SAFE_CALL( cutStartTimer( timer0));

	

	for(int i=0;i<10000;i++)

	{

  update();

  //onCpu();

  render();

	}

	CUT_SAFE_CALL( cutStopTimer( timer0));

	printf("gpu time: %f\n", cutGetTimerValue(timer0));

	

	unsigned int timer1;

	CUT_SAFE_CALL( cutCreateTimer( &timer1));

	CUT_SAFE_CALL( cutStartTimer( timer1));

	for(int i=0;i<10000;i++)

	{

  //update();

  onCpu();

  render();

	}

	

	CUT_SAFE_CALL( cutStopTimer( timer1));

	printf("cpu time: %f\n", cutGetTimerValue(timer1));

	return 1;

}

void onCpu()

{

	for(int i=0;i<NUM_PARTICULES;i++)

	{

  partPos[i].z+=partForces[i].z;

	}

}

void cudaInit()

{

  CUT_DEVICE_INIT();

  int mem_size = NUM_PARTICULES*sizeof(float4);

  CUDA_SAFE_CALL(cudaMalloc((void**) &gpuPos, mem_size));

  CUDA_SAFE_CALL(cudaMalloc((void**) &gpuForce, mem_size));

  CUDA_SAFE_CALL(cudaMemcpy(gpuForce, partForces, mem_size, cudaMemcpyHostToDevice) );

	

}

void update()

{

	//CUDA_SAFE_CALL(cudaMemcpy(gpuPos, partPos, mem_size, cudaMemcpyHostToDevice) );         

    integrate<<< BLOCK_CNT, THREAD_CNT >>>((float4*) gpuPos,(float4*) gpuForce,DT);

    //CUDA_SAFE_CALL(cudaMemcpy(partPos, gpuPos, mem_size,cudaMemcpyDeviceToHost));                       

}

Which result in:

gpu time: 2151.280029

cpu time: 2044.373657

On a core2duo e6550 and geforce 8800gt. I was expecting a much lower gpu time and im guessing its due to some rookie mistake on my part. Can anyone tell me what it is?

Sorry about the messy code, im only just learning, not writing anything useful at the moment.

Thanks!

Sami

–EDIT

After some more testing, i see the gpu performances starting to ramp up when i quadruple the number of particules

gpu time: 8268.873047

cpu time: 32456.363281

Am i not using the gpu entirely with an array of only 104200? I would seem unlikely to me, but hey… rookie!

seibert · November 27, 2007, 7:58pm

From your numbers, 4x the particles leads to 4x the GPU time but 16x the CPU time. That seems strange for the CPU. Is there some O(N^2) operation going on that I’m not seeing?

MisterAnderson42 · November 27, 2007, 8:02pm

You are doing everything correct to get optimal performance, though you could possibly get a few more % by tweaking THREAD_CNT. Amazingly enough, a mere 1042000 additions is a “small” problem for the GPU. I’ve been doing particle sims since March on the GPU, and I don’t think my integrate step is any faster than yours at the moment :) Where I really get screaming performance is in the calculation of the forces where I need to add up forces between each particle and ~100 of his neighbors. Even that operation requires at least 10,000 particles on the GPU in order to start achieving peak performance.

Calculate how man GB/s of data is transferred, that is going to be your bottleneck: two floats read and one float write per thread to see how optimally you are using the GPU (70GB/s is an achievable peak). Note also that kernel launches are asynchronous, so to get correct timing, you really need to have a cudaThreadSychronize() call before getting the final timer value.

Also, you will find that as you add more and more calculations to your integrate function (i.e. simple calculations of varying forces), the kernel will not take any longer to run! (at least as long as you don’t add more memory operations in the process). This is because the GPU has a massive number of GFLOPS available to do while waiting for memory reads to take place.

Ailleur · November 27, 2007, 8:04pm

Strange indeed, i had not noticed that. I posted all the code, render() is empty and initParticules() is only ran before both benchmarks. So, as far as i can see, no O(n^2) in there.

And since i am getting 4x GPU times as you have pointed out, i am not utilizing the gpu any more than i was before…

Also, as a follow up question, if i uncomment both memcpy lines, i get (abysmally) slow results for the gpu. Is this normal?

Looking at the samples, it appears that a gl buffers are used to store data that will ultimetely be used as rending information, so am i right in saying that if i do not intend to use the data on the cpu, i should use gl buffers and avoid memcpy?

wildcat4096 · November 27, 2007, 8:08pm

You are not actually doing much work on the GPU as far as I can tell… you are doing several memory accesses though. Is there more work that you can move onto the GPU. Looks like you are updating positions. Could you also compute the forces?

MisterAnderson42 · November 27, 2007, 8:10pm

4x → 16x does seem a little strange. It could be that the data no longer fits in the L1 (or maybe L2) cache for those 10,000 iterations but I don’t expect that to be so drastic an effect for simple linear mem access. You are compiling with optimizations enabled, right?

About the memcpys: They are done over PCI-Express which maxes out at 2-3 GB/s using cudaMallocHost memory. They are going to be extremely slow no matter what you do. The key to achieving the best possible performance is to copy data back and forth as little as possible. If your intention is to render on the GPU in realtime, the GL interop is going to be the best route, though I have no experience with it.

Ailleur · November 27, 2007, 8:13pm

Indeed i am not doing anything significant for now. This was just to test how much quicker a SIMD like operation on a vector could be done on the gpu. My initial guess was that accesses to the shared memory was slowing me down and i guess that this is what youre telling me.

Thanks for the quick help everybody!

wildcat4096 · November 27, 2007, 9:40pm

If I remember correctly… and there are those much smarter than me on there that will correct me if this is wrong… but listening to the UIUC spring 2007 lectures I seem to recall them saying that a rough rule of thumb was roughly 4 arithmetic operations per memory operations was needed to have “decent” performance. Again, this may be way wrong.

Personally, I have had a few really amazing GPU results but also a lot of duds with the duds almost always memory related. At least you are implementing something that you can be fairly certain can be written to work well on the GPU.

Best of luck… this is fun stuff!

Ailleur · November 27, 2007, 9:43pm

Fun indeed. And i can confirm that. I simply added more arithmetic work, like this:

int index = __mul24(blockIdx.x,blockDim.x) + threadIdx.x;

	__shared__ float4 shd[THREAD_CNT];

	shd[threadIdx.x] = g_data[index];

	__syncthreads();

	

	if(threadIdx.x<120)

	{

  shd[threadIdx.x].z+=shd[threadIdx.x+1].z;

  shd[threadIdx.x].z+=shd[threadIdx.x+2].z;

  shd[threadIdx.x].z+=shd[threadIdx.x+3].z;

  shd[threadIdx.x].z+=shd[threadIdx.x+1].z;

  shd[threadIdx.x].z+=shd[threadIdx.x+2].z;

  shd[threadIdx.x].z+=shd[threadIdx.x+3].z;

  shd[threadIdx.x].z+=shd[threadIdx.x+1].z;

  shd[threadIdx.x].z+=shd[threadIdx.x+2].z;

  shd[threadIdx.x].z+=shd[threadIdx.x+3].z;

	}

	__syncthreads();

And i now have an appreciable gain when running on the gpu. Again, this does absolutely nothing useful. Im just getting my head around the architecture.

wildcat4096 · November 27, 2007, 10:14pm

BTW: In case you are interested… here is the link to the course (mp3s and ppts) I mentioned.

[url=“Course Websites | The Grainger College of Engineering | UIUC”]http://courses.ece.uiuc.edu/ece498/al1/Syllabus.html[/url]

Neeraj_Kulkarni · November 28, 2007, 3:46pm

Thanks for the link wildcat4096!

Topic		Replies	Views
Confused about GPU vs CPU speed in multiplication CUDA Programming and Performance	8	6556	February 19, 2009
Can you GUESS this without experimenting? Latencies CUDA Programming and Performance	13	9354	January 7, 2008
GPU/CPU precision comparison and Kernel instructions question CUDA Programming and Performance	5	686	April 4, 2017
Cuda code performance CUDA Programming and Performance	14	3168	December 16, 2014
need a help from employees or guys who know compiler well CUDA Programming and Performance	22	8626	December 18, 2008
CUDA slower than CPU? CUDA Programming and Performance	7	868	August 18, 2023
Simple/1st CUDA program: Reverse bits in byte Why is it faster on the CPU? CUDA Programming and Performance	11	7156	December 6, 2007
GPU - CPU Performance comparison on string conversion i7 860 3.5GHz beat out NVidia 9800 GT CUDA Programming and Performance	11	10668	January 4, 2011
device speed vs. host speed Why is my device program so slow? CUDA Programming and Performance	8	7893	August 16, 2007
well how do I know if cuda runs on the gpu CUDA Programming and Performance	20	13508	July 9, 2008

Help on fixing some poor performances (rookie)

Related topics