I started a small project that calculates electrostatic field lines, and displays them with OpenGL. The purpose of this was to learn CUDA, and get familiar with making visual representations of the results. I am a bit stuck on optimizing the “master” kernel. I’m not as much interested in squeezing max performance of this particular kernel, as I am in getting comfortable with effective optimization techniques. Here’s the code:

[codebox]template<>

struct Vector3

{

```
float x, y, z;
```

};[/codebox]

[codebox]// Allows point charges to be read and written as float4 type

template <>

struct **align**(16) pointCharge

{

```
union
{
float4 ftype;
struct
{
Vector3<float> position;
float magnitude;
};
};
```

};[/codebox]

[codebox]**global** void CalcField_SPkernel(Vector3 *FieldVecs, pointCharge *Charges,

```
unsigned int n, unsigned int p, unsigned int fieldIndex, float resolution)
```

{

```
unsigned int tx = threadIdx.x;
unsigned int ti = blockDim.x * blockIdx.x + tx;
// Shared array to hold the charges
__shared__ pointCharge<float> charge[BLOCK_X];
// previous point ,used to calculate current point, and cumulative field vector
Vector3<float> point, temp = {0, 0, 0};
// Number of iterations of main loop
unsigned int steps = (p + BLOCK_X - 1) / BLOCK_X;
// Load starting point
point = FieldVecs[n * (fieldIndex - 1) + ti];
for(unsigned int step = 0; step < steps; step++)
{
// Load point charges from global memory
unsigned int pId = step * BLOCK_X + tx;
if (pId < p) // This if statement has minimal impact on performance
charge[tx] = Charges[pId];
else
charge[tx].magnitude = 0; // magnitude of 0 should not affect result
// Wait for all loads to complete
__syncthreads();
// Completely unroll the following loop
#pragma unroll
for(unsigned int i = 0; i < BLOCK_X; i++)
{
vec3Addto(temp, electroPartField(charge[i], point) ); // 15FLOP(ElectroPartField) + 3 FLOPs
}
}
// Finally, add the unit vector of the field divided by the resolution to the previous point to get the next point
FieldVecs[n * fieldIndex + ti] = vec3Div(vec3Add(point, vec3Unit(temp) ), resolution);// 15 FLOPs (3 div + 3 add + 9 unit)
```

}[/codebox]

Now, Vector3 is a 12-byte type, and I was not able to coalesce the memory read. I tried loading the array in shared mem by pieces, but the overhead associated with that was too great, and performance poor. I’m currently getting about 136 GFLOP/s (on a 9800GT) with 16384 threads, and it maxes out around147 GFLOP/s at 20480 threads or more. My main concern is getting more performance with these large data sets. I’m not as worrried about “multithreading” individual threads.

Any suggestion is greatly appreciated.

Also, the block size is 64. I’ve tried other values, but this one produces the best performance.