Hi,
I’m developing a CSG evaluation module on Cuda. One part involves calculating distances from a large number of arbitrary
points to a given sphere as well as the local normal vector in these points relative to the sphere. An “embarrassingly parallel” problem.
I have an input array with points (float4 for alignment reasons) and an output array of float4’s containing the
(normalized) normal vector (x,y,z) and the distance to the sphere (w).
On a MacBookPro (GeForce 8600M GT, 32 cores@0,93Ghz) it takes about 0.09 (s) to process five million points.
On a (GTX285, 240 cores@1,48Ghz )-machine, it takes about 0.06 (s) for the same setup. Unnecessary to say I’m
a bit disappointed about the speedup.
Different blocksizes don’t do that much, if I stick to multiples of 64 (they seldom do, are they overrated or is it just me?),
occupancy is 1 in all cases,
no divergent branches,
no non-coalescing reads or writes,
No memcpy overhead because i’m using paged-memory (cudaMallocHost()),
With five million points and a blocksize of (say) 128 i would say i have enough data throughput to keep the GPU busy…
I’m running out of ideas…
Any thoughts?
Kind Regards,
Daniel Dekkers
PS, this is the kernel, i need a pitch and labels because p_CudaResultArray is actually a 2D array. One row is processed at a time by the GPU.
[codebox]global void CudaCSGSphereKernel_Normal
(const float4* p_CudaPointArray,
float4* p_CudaResultArray,
const int p_Size,
const size_t p_Pitch,
const unsigned char p_Label,
const float p_Radius,
const float3 p_Center)
{
int l_PointIndex = (blockIdx.x * blockDim.x + threadIdx.x);
if (l_PointIndex<p_Size)
{
float4 l_Result = p_CudaPointArray[l_PointIndex];
float4* l_Row;
l_Result.x -= p_Center.x;
l_Result.y -= p_Center.y;
l_Result.z -= p_Center.z;
float l_Length = max(sqrt(l_Result.x*l_Result.x + l_Result.y*l_Result.y + l_Result.z*l_Result.z), 0.00001f);
// Normalize...
l_Result.x = l_Result.x/l_Length;
l_Result.y = l_Result.y/l_Length;
l_Result.z = l_Result.z/l_Length;
l_Result.w = p_Radius - l_Length;
l_Row = (float4*)((char*)p_CudaResultArray + p_Label*p_Pitch);
l_Row[l_PointIndex] = l_Result;
}
}[/codebox]