GPU's vs Intel vs NEHALEM's/AMD Barcelona vs IBMCELL prcoessor both SP and DP comparison +

Hello all…

I finally got time to arrange and share the paper I/my friend presented at the International Conference on Parallel Processing 2009 held in Vienna.

The paper is titled:

Direct N-body Kernels for Multicore Platforms

and compares the performance of multi-core Nehalem’s,Barcelona’s with the IBM CELL Broad Band engine (2 cell processors) and Tesla C1060 and C870 for both single and double precision Direct N-Body Kernel. The results for DP are certainly interesting and go in favour of GPU’S :) . It has few interesting graphs I guess… We tried to provide as much insight/information as we can within the limit of 10 pgs (IEEE). :confused:

This was my first actual research project in CUDA (summer of last year) . I hope it helps some of you out there.

I will be putting my other relevant papers (other GPU related research) on my web-page ( which is in a bad condition rite now :( ) … once I convince my advisor’s in making them available publicly without violating the publishing license :teehee:


I am open to any views … good or bad… :)

NOTE: This is PreConf version… and may have some typos ( eg Table 1 Power Consumptions for Nehalem and Barcelona)

Kudos for diving into the painful world of Cell programming.

I think all the guys at NV had big big smiles when they saw your figure #12.

Another interesting reference
is an n-body implementation on ATI GPUs by Fujiwara and Nakasato .

Yeah… Cell is Hell :-)

YEA :verymad: … BIG TIME HELL !!

Thanks for ref… interesting… it has nice hardware features for these kind of computations… and enables double unrolling and all …

I don’t have any knowledge about ATI hardware. How different is it from nvidia ? (in respect to the programming model)

If only CUDA was hardware independent… :teehee:

Love it……;mpage=2#138083

The programming model is quite similar. There are a few noticeable differences:

  • With ATI hardware you have to code for SIMD explicitly. They use 5-wide vector units (well, more like 4+1). CUDA abstracts from this, kernels are written as if there were scalar units in hardware.

  • There are more levels of, so to say, compute capabilities among cards. And the differences are quite pronounced, for example only the most recent Dx11 cards got truly random access read/write shared memory.

  • Coalescing rules may be a bit more restrictive and you get most performance out of accessing float4s.

The overall model is similar in that you write data-parallel kernels in a C-like language. Here’s their NBody kernel in OpenCL


  • For a description of the algorithm and the terms used, please see the

  • documentation for this sample.

  • Each work-item invocation of this kernel, calculates the position for

  • one particle

  • Work-items use local memory to reduce memory bandwidth and reuse of data





__global float4* pos ,

__global float4* vel,

int numBodies,

float deltaTime,

float epsSqr,

__local float4* localPos)


unsigned int tid = get_local_id(0);

unsigned int gid = get_global_id(0);

unsigned int localSize = get_local_size(0);

// Number of tiles we need to iterate

unsigned int numTiles = numBodies / localSize;

// position of this work-item

float4 myPos = pos[gid];

float4 acc = (float4)(0.0f, 0.0f, 0.0f, 0.0f);

for(int i = 0; i < numTiles; ++i)


// load one tile into local memory

int idx = i * localSize + tid;

localPos[tid] = pos[idx];

// Synchronize to make sure data is available for processing


// calculate acceleration effect due to each body

// a[i->j] = m[j] * r[i->j] / (r^2 + epsSqr)^(3/2)

for(int j = 0; j < localSize; ++j)


// Calculate acceleartion caused by particle j on particle i

float4 r = localPos[j] - myPos;

float distSqr = r.x * r.x + r.y * r.y + r.z * r.z;

float invDist = 1.0f / sqrt(distSqr + epsSqr);

float invDistCube = invDist * invDist * invDist;

float s = localPos[j].w * invDistCube;

// accumulate effect of all particles

acc += s * r;


// Synchronize so that next tile can be loaded



float4 oldVel = vel[gid];

// updated position and velocity

float4 newPos = myPos + oldVel * deltaTime + acc * 0.5f * deltaTime * deltaTime;

newPos.w = myPos.w;

float4 newVel = oldVel + acc * deltaTime;

// write to global memory

pos[gid] = newPos;

vel[gid] = newVel;



It’s not optimized, it’s a “didactic” example. I’ve no idea why they put so much whitespace in.

This is regarding programming in OpenCL. There’s also CAL (which is more alike programming in PTX) and Brook++ (of which I don’t know much).

In ‘Table 1’

The power consumption for Neahalem and Barcelona are both off by a factor of four. (Two Xeons do not consume 760W)

You are right… shooot… I uploaded the pre-conf version of the paper… sorry for that… I have mentioned the typo in the first post now…

We got this corrected on the day of the conf :)

Thanks for pointing it out :)