Hi,

I’m developing a CSG evaluation module on Cuda. One part involves calculating distances from a large number of arbitrary

points to a given sphere as well as the local normal vector in these points relative to the sphere. An “embarrassingly parallel” problem.

I have an input array with points (float4 for alignment reasons) and an output array of float4’s containing the

(normalized) normal vector (x,y,z) and the distance to the sphere (w).

On a MacBookPro (GeForce 8600M GT, 32 cores@0,93Ghz) it takes about 0.09 (s) to process five million points.

On a (GTX285, 240 cores@1,48Ghz )-machine, it takes about 0.06 (s) for the same setup. Unnecessary to say I’m

a bit disappointed about the speedup.

Different blocksizes don’t do that much, if I stick to multiples of 64 (they seldom do, are they overrated or is it just me?),

occupancy is 1 in all cases,

no divergent branches,

no non-coalescing reads or writes,

No memcpy overhead because i’m using paged-memory (cudaMallocHost()),

With five million points and a blocksize of (say) 128 i would say i have enough data throughput to keep the GPU busy…

I’m running out of ideas…

Any thoughts?

Kind Regards,

Daniel Dekkers

PS, this is the kernel, i need a pitch and labels because p_CudaResultArray is actually a 2D array. One row is processed at a time by the GPU.

[codebox]**global** void CudaCSGSphereKernel_Normal

(const float4* p_CudaPointArray,

```
float4* p_CudaResultArray,
```

const int p_Size,

const size_t p_Pitch,

const unsigned char p_Label,

const float p_Radius,

const float3 p_Center)

{

```
int l_PointIndex = (blockIdx.x * blockDim.x + threadIdx.x);
if (l_PointIndex<p_Size)
{
float4 l_Result = p_CudaPointArray[l_PointIndex];
float4* l_Row;
l_Result.x -= p_Center.x;
l_Result.y -= p_Center.y;
l_Result.z -= p_Center.z;
float l_Length = max(sqrt(l_Result.x*l_Result.x + l_Result.y*l_Result.y + l_Result.z*l_Result.z), 0.00001f);
// Normalize...
l_Result.x = l_Result.x/l_Length;
l_Result.y = l_Result.y/l_Length;
l_Result.z = l_Result.z/l_Length;
l_Result.w = p_Radius - l_Length;
l_Row = (float4*)((char*)p_CudaResultArray + p_Label*p_Pitch);
l_Row[l_PointIndex] = l_Result;
}
```

}[/codebox]