Quick question about memory coalescence

I have a code shown as follows. I use the memory padding to maximize the memory throughout. According to the Nvidia Fast N-body Simulation code, they use float4 to store the mass of the bodies into the w field of the body’s float4 position, which is also memory coalescence. My question is, which way of coding is better? Also, we don’t need to use float4 variables necessarily if we have an algorithm of padding the memory, do we?
Nvidia’s code:

device float3 bodyBodyInteraction(float bi, float4 bj, float3 ai)
{
float3 r;

// r_ij

r.x = bj.x - bi.x;
r.y = bj.y - bi.y;
r.z = bj.z - bi.z;

or page 6 of http://media.hpcwire.com/documents/nbody_g…ch31%5B1%5D.pdf

My code:
-----------------------------------------------------------------------------------------------------// =========================================

host void ComputeGravityOnDevice( const float *pos, const float *mass,
float *accel, const float eps2,
const unsigned int n ) {
// Does the gravity on the GPU

// This routine takes the list of positions and masses, and computes the
// acceleration for each particle

// pos[3n] holds the x,y and z co-ordinates of each particle in pos[(3i)+0…(3i)+2]
// mass[n] holds the masses
// accel[3
n] is undefined on entry, and is to hold the accelerations similarly to pos
// eps2 is the square of the softening length

// In order to maximise GPU thread usage, we want to make the total number of particles
// a multiple of chunkSize. We do this by padding the arrays with zero mass particles

float3 *rDev, *aDev;
float *mDev;

static float3 rPad[chunkSize];
static float mPad[chunkSize];

unsigned int nDev;

long nBlocks, nThreads;

static bool firstCall = true;
struct cudaDeviceProp devInfo;
clock_t start, stop;

// Get some information about the device
cudaGetDeviceProperties( &devInfo, 0 );
if( devInfo.maxThreadsPerBlock < chunkSize ) {
printf(“chunkSize larger than maxThreadsPerBlock\n”);
abort();
}

// Want to zero out the arrays we use for padding
// These are static so that we don’t have to do this every time
if( firstCall ) {
for( unsigned int i=0; i < chunkSize; i++ ) {
rPad[i] = make_float3(0,0,0);
mPad[i] = 0;
}
}

// Compute nDev, the number of particles which will be on the device
if( (n % chunkSize) == 0 ) {
// Divides exactly
nDev = n;
}
else {
// Doesn’t divide
nDev = (1+(n/chunkSize)) * chunkSize;
}

// Allocate arrays on device
cudaMalloc( (void**)&rDev, nDevsizeof(float3) );
cudaMalloc( (void**)&aDev, nDev
sizeof(float3) );
cudaMalloc( (void**)&mDev, nDev*sizeof(float) );

// Copy positions and masses to device
start = clock();

// First the real data
cudaMemcpy( rDev, pos, 3nsizeof(float), cudaMemcpyHostToDevice );
cudaMemcpy( mDev, mass, n*sizeof(float), cudaMemcpyHostToDevice );

// Now, the padding data
cudaMemcpy( (void**)&rDev[n], rPad, 3*(nDev-n)sizeof(float), cudaMemcpyHostToDevice );
cudaMemcpy( (void
*)&mDev[n], mPad, (nDev-n)*sizeof(float), cudaMemcpyHostToDevice );

stop = clock();
GPUMemSend += (stop-start);

// Compute the number of blocks and threads
// nDev will be an integer multiple of chunkSize
nThreads = chunkSize;
nBlocks = nDev / chunkSize;

// Let the user know something about what’s going on
if( firstCall ) {
printf("\nGPU Calculation Information:\n");
printf(“nBlocks : %i\n”, nBlocks);
printf(“nThreads: %i\n”, nThreads);
printf(“nDev : %i\n”, nDev);
printf(“Padding : %i\n”, nDev-n);
printf("\n\n");
}

start = clock();
// Launch on device
// Remember the device sees nDev particles
DoGravityDevice<<<nBlocks,nThreads>>>( rDev, mDev, aDev, eps2, nDev );

// Calls to the GPU are asynchronous, so if we want our timings to be correct
// we have to wait at this point
cudaThreadSynchronize();
stop = clock();
GPUticks += (stop-start);

// Copy back accel array from device
// We only need to copy the n particles, not all the way to nDev
start = clock();
cudaMemcpy( accel, aDev, 3nsizeof(float), cudaMemcpyDeviceToHost );
stop = clock();
GPUMemRec += (stop-start);

// Free memory on device
cudaFree( rDev );
cudaFree( aDev );
cudaFree( mDev );

// Indicate that we’ve done at least one call
firstCall = false;

}

// ==========================================================================

Thank you.

I think you are not padding right.

You need to pad ‘interleaved’ The memory is laid out like this:
x[0] y[0] z[0] w[0] x[1] y[1] z[1] w[1] …
If I read your code, you think it is laid out like this :

x[0] x[1] … x[N] y[0] y[1] … y[N} z[0] z[1] … z[N] w[0] w[1] … w[N]

So definitely take the same way NVIDIA has done, the SDK samples can really be seen as best practices usually.

Do you think it should be something like this:

// First the real data

cudaMemcpy( rDev, pos, 3nsizeof(float), cudaMemcpyHostToDevice );

cudaMemcpy( mDev, mass, n*sizeof(float), cudaMemcpyHostToDevice );

// Now, the padding data

cudaMemcpy( (void**)&rDev[n], rPad, (nDev-n)*sizeof(float3), cudaMemcpyHostToDevice );

cudaMemcpy( (void**)&mDev[n], mPad, (nDev-n)*sizeof(float), cudaMemcpyHostToDevice );

No, I think you have to do something like this:

float4 pos_cuda = malloc(nDev * sizeof(float4));

pos_cuda[n_cuda].x = pos[n];
pos_cuda[n_cuda].y = pos[n+1];
pos_cuda[n_cuda].z = pos[n+2];
pos_cuda[n_cuda].x = 0.0f;

because you will never be able to read pos coalesced, as you will need 3 consecutive values per thread. Now you can just read float4 values coalesced.

Is it possible not to use float4?

Using a float4 is much simpler and cleaner, and then you also have the option of reading from a texture if portions of your algorithm cannot read coalesced. I use float4’s in my application for storing x,y,z data, although I do stuff a particle type identifier in w so it isn’t completely wasted.

If you really want to read straight float3’s coalesced, it can be done. You need to allocate a staging area in shared memory. Then have the appropriate number of threads in the block (boundary conditions get very tricky heere…) read in floats: one thread reads each float. Then __syncthreads(); and each thread can now access whatever float3’s in shared memory it would have read as a straight up float4.

As you can see, this method gets quite complicated very fast: a one-line memory read of a float4 converted to read float3’s coalesced becomes a ~5-10 line (I’m guessing here) mess of shared memory management => more prone to errors.

As far as efficiency goes: you could maximally increase you throughput by 25% (3 floats instead of 4), but the added cost of all the __syncthreads() and the wasted shared memory could offset those benefits significantly.