Help on fixing some poor performances (rookie)

Hi, this is my first post, being written a couple of hours after i started playing with CUDA, so my question will no doubt be trivial, sorry in advance!

I was quite amazed at the particule simulator’s performance and having written my own in plain old C, i started wondering if i could translate it to use CUDA.

To get myself familiar with the tools, i started writing a simple program that would take 2 vectors and do some math between them.

I write both vector once to device memory then loop the kernel 10000 times and compare this execution on both gpu and cpu to find out that the execution time is pretty much the same.

Here is the code:

#define THREAD_CNT 256

#define NUM_PARTICULES 104200

const int BLOCK_CNT = ceil((float)NUM_PARTICULES/(float)THREAD_CNT);

float4 partPos[NUM_PARTICULES];

float4 partForces[NUM_PARTICULES];

#define DT 0.001
__global__ void integrate( float4* g_data, float4* g_force,float dt )

{

	int index = __mul24(blockIdx.x,blockDim.x) + threadIdx.x;

	g_data[index].z+=g_force[index].z;

}
int main(int argc, char** argv)

{

	

	initParticules();

	cudaInit();

	

	unsigned int timer0;

	CUT_SAFE_CALL( cutCreateTimer( &timer0));

	CUT_SAFE_CALL( cutStartTimer( timer0));

	

	for(int i=0;i<10000;i++)

	{

  update();

  //onCpu();

  render();

	}

	CUT_SAFE_CALL( cutStopTimer( timer0));

	printf("gpu time: %f\n", cutGetTimerValue(timer0));

	

	unsigned int timer1;

	CUT_SAFE_CALL( cutCreateTimer( &timer1));

	CUT_SAFE_CALL( cutStartTimer( timer1));

	for(int i=0;i<10000;i++)

	{

  //update();

  onCpu();

  render();

	}

	

	CUT_SAFE_CALL( cutStopTimer( timer1));

	printf("cpu time: %f\n", cutGetTimerValue(timer1));

	return 1;

}
void onCpu()

{

	for(int i=0;i<NUM_PARTICULES;i++)

	{

  partPos[i].z+=partForces[i].z;

	}

}
void cudaInit()

{

  CUT_DEVICE_INIT();

  int mem_size = NUM_PARTICULES*sizeof(float4);

  CUDA_SAFE_CALL(cudaMalloc((void**) &gpuPos, mem_size));

  CUDA_SAFE_CALL(cudaMalloc((void**) &gpuForce, mem_size));

  CUDA_SAFE_CALL(cudaMemcpy(gpuForce, partForces, mem_size, cudaMemcpyHostToDevice) );

	

}
void update()

{

	//CUDA_SAFE_CALL(cudaMemcpy(gpuPos, partPos, mem_size, cudaMemcpyHostToDevice) );         

    integrate<<< BLOCK_CNT, THREAD_CNT >>>((float4*) gpuPos,(float4*) gpuForce,DT);

    //CUDA_SAFE_CALL(cudaMemcpy(partPos, gpuPos, mem_size,cudaMemcpyDeviceToHost));                       

}

Which result in:

gpu time: 2151.280029

cpu time: 2044.373657

On a core2duo e6550 and geforce 8800gt. I was expecting a much lower gpu time and im guessing its due to some rookie mistake on my part. Can anyone tell me what it is?

Sorry about the messy code, im only just learning, not writing anything useful at the moment.

Thanks!

Sami

–EDIT

After some more testing, i see the gpu performances starting to ramp up when i quadruple the number of particules

gpu time: 8268.873047

cpu time: 32456.363281

Am i not using the gpu entirely with an array of only 104200? I would seem unlikely to me, but hey… rookie!

From your numbers, 4x the particles leads to 4x the GPU time but 16x the CPU time. That seems strange for the CPU. Is there some O(N^2) operation going on that I’m not seeing?

You are doing everything correct to get optimal performance, though you could possibly get a few more % by tweaking THREAD_CNT. Amazingly enough, a mere 1042000 additions is a “small” problem for the GPU. I’ve been doing particle sims since March on the GPU, and I don’t think my integrate step is any faster than yours at the moment :) Where I really get screaming performance is in the calculation of the forces where I need to add up forces between each particle and ~100 of his neighbors. Even that operation requires at least 10,000 particles on the GPU in order to start achieving peak performance.

Calculate how man GB/s of data is transferred, that is going to be your bottleneck: two floats read and one float write per thread to see how optimally you are using the GPU (70GB/s is an achievable peak). Note also that kernel launches are asynchronous, so to get correct timing, you really need to have a cudaThreadSychronize() call before getting the final timer value.

Also, you will find that as you add more and more calculations to your integrate function (i.e. simple calculations of varying forces), the kernel will not take any longer to run! (at least as long as you don’t add more memory operations in the process). This is because the GPU has a massive number of GFLOPS available to do while waiting for memory reads to take place.

Strange indeed, i had not noticed that. I posted all the code, render() is empty and initParticules() is only ran before both benchmarks. So, as far as i can see, no O(n^2) in there.

And since i am getting 4x GPU times as you have pointed out, i am not utilizing the gpu any more than i was before…

Also, as a follow up question, if i uncomment both memcpy lines, i get (abysmally) slow results for the gpu. Is this normal?

Looking at the samples, it appears that a gl buffers are used to store data that will ultimetely be used as rending information, so am i right in saying that if i do not intend to use the data on the cpu, i should use gl buffers and avoid memcpy?

You are not actually doing much work on the GPU as far as I can tell… you are doing several memory accesses though. Is there more work that you can move onto the GPU. Looks like you are updating positions. Could you also compute the forces?

4x -> 16x does seem a little strange. It could be that the data no longer fits in the L1 (or maybe L2) cache for those 10,000 iterations but I don’t expect that to be so drastic an effect for simple linear mem access. You are compiling with optimizations enabled, right?

About the memcpys: They are done over PCI-Express which maxes out at 2-3 GB/s using cudaMallocHost memory. They are going to be extremely slow no matter what you do. The key to achieving the best possible performance is to copy data back and forth as little as possible. If your intention is to render on the GPU in realtime, the GL interop is going to be the best route, though I have no experience with it.

Indeed i am not doing anything significant for now. This was just to test how much quicker a SIMD like operation on a vector could be done on the gpu. My initial guess was that accesses to the shared memory was slowing me down and i guess that this is what youre telling me.

Thanks for the quick help everybody!

If I remember correctly… and there are those much smarter than me on there that will correct me if this is wrong… but listening to the UIUC spring 2007 lectures I seem to recall them saying that a rough rule of thumb was roughly 4 arithmetic operations per memory operations was needed to have “decent” performance. Again, this may be way wrong.

Personally, I have had a few really amazing GPU results but also a lot of duds with the duds almost always memory related. At least you are implementing something that you can be fairly certain can be written to work well on the GPU.

Best of luck… this is fun stuff!

Fun indeed. And i can confirm that. I simply added more arithmetic work, like this:

int index = __mul24(blockIdx.x,blockDim.x) + threadIdx.x;

	__shared__ float4 shd[THREAD_CNT];

	shd[threadIdx.x] = g_data[index];

	__syncthreads();

	

	if(threadIdx.x<120)

	{

  shd[threadIdx.x].z+=shd[threadIdx.x+1].z;

  shd[threadIdx.x].z+=shd[threadIdx.x+2].z;

  shd[threadIdx.x].z+=shd[threadIdx.x+3].z;

  shd[threadIdx.x].z+=shd[threadIdx.x+1].z;

  shd[threadIdx.x].z+=shd[threadIdx.x+2].z;

  shd[threadIdx.x].z+=shd[threadIdx.x+3].z;

  shd[threadIdx.x].z+=shd[threadIdx.x+1].z;

  shd[threadIdx.x].z+=shd[threadIdx.x+2].z;

  shd[threadIdx.x].z+=shd[threadIdx.x+3].z;

	}

	__syncthreads();

And i now have an appreciable gain when running on the gpu. Again, this does absolutely nothing useful. Im just getting my head around the architecture.

BTW: In case you are interested… here is the link to the course (mp3s and ppts) I mentioned.

http://courses.ece.uiuc.edu/ece498/al1/Syllabus.html

Thanks for the link wildcat4096!