I have a kernel as follow: (KDCONSTRUCT_THREADS == 128)

I need to be 1.0 capable (no atomic add on shared memory ;((

Is this code optimal ? or is there any way to make it better ?

BTW. There is a guarantee that number of triangles that is inside the box is < TRI_CACHE_SIZE.

__global__ void KDSomething(const unsigned int FirstTrianglesIndex, const unsigned int LastTrianglesIndex, const float3 bbmin, const float3 bbmax)
{
__shared__ unsigned int TrianglesCache[TRI_CACHE_SIZE];
__shared__ unsigned int TrinaglesCacheCnt;
if (threadIdx.x == 0)
Â Â TrinaglesCacheCnt = 0;
__syncthreads();
for (unsigned int i = FirstTrianglesIndex; i < LastTrianglesIndex; i+=KDCONSTRUCT_THREADS)
Â {
Â Â const unsigned int idx Â = i + threadIdx.x;
Â Â Â Â Â unsigned int HIdx = 0xFFFFFFFF;
Â Â if (idx < gNumberOfTriangles)
Â Â {
Â Â float4 tmin = tex1Dfetch(PolygonsMinMax, (idx<<1)+0);
Â Â float4 tmax = tex1Dfetch(PolygonsMinMax, (idx<<1)+1);
Â Â if (tmax.x >= bbmin.x && tmax.y >= bbmin.y && tmax.z >= bbmin.z &&
Â Â Â tmin.x <= bbmax.x && tmin.y <= bbmax.y && tmin.z <= bbmax.z Â )
Â Â {
Â Â Â HIdx = idx;
Â Â }
Â Â }
Â Â Â Â Â Â Â Â Â Â Â Â //
Â Â Â Â Â Â Â Â Â Â Â Â // is there any construct that's better replaces this:
Â Â Â Â Â Â Â Â Â Â Â Â //
Â Â Â Â Â Â Â Â Â Â Â Â // if (HIdx != 0xFFFFFFFF) TrianglesCache[atomicAdd(&TrinaglesCacheCnt, 1)] = HIdx;
Â Â Â Â Â Â Â Â Â Â Â Â // ??
Â Â __syncthreads();
Â Â for (int k = 0; k < KDCONSTRUCT_THREADS; k++)
Â Â {
Â Â if (k == threadIdx.x && HIdx != 0xFFFFFFFF)
Â Â Â TrianglesCache[TrinaglesCacheCnt++] = HIdx;
Â Â __syncthreads();
Â Â }
Â }
//
// Rest of kernel that uses indices from TrianglesCache follows for other calculations follows
//
}

… i i’v understand corectly, something like this should be safe in terms of atomic writes ?

and collision on ‘TrianglesCacheCnt’ should never happend ?

__global__ void KDSomething(const unsigned int FirstTrianglesIndex, const unsigned int LastTrianglesIndex, const float3 bbmin, const float3 bbmax)
{
__shared__ unsigned int TrianglesCache[TRI_CACHE_SIZE];
__shared__ unsigned int TrianglesCacheCnt;
__shared__ unsigned int PerThreadIDX[KDCONSTRUCT_THREADS];
if (threadIdx.x == 0)
TrianglesCacheCnt = 0;
__syncthreads();
for (unsigned int i = FirstTrianglesIndex; i < LastTrianglesIndex; i+=KDCONSTRUCT_THREADS)
{
const unsigned int idx = i + threadIdx.x;
PerThreadIDX[threadIdx.x] = 0xFFFFFFFF;
if (idx < gNumberOfTriangles)
{
float4 tmin = tex1Dfetch(PolygonsMinMax, (idx<<1)+0);
float4 tmax = tex1Dfetch(PolygonsMinMax, (idx<<1)+1);
if (tmax.x >= bbmin.x && tmax.y >= bbmin.y && tmax.z >= bbmin.z &&
tmin.x <= bbmax.x && tmin.y <= bbmax.y && tmin.z <= bbmax.z )
{
PerThreadIDX[threadIdx.x]=idx;
}
}
__syncthreads();
//
// Let's first thread in each warp write indices from all threads in this warp
//
if ((threadIdx.x & 0x1F) == 0)
{
for (int k = 0; k < 32; k++)
{
unsigned int HIdx = PerThreadIDX[threadIdx.x+k];
if (HIdx != 0xFFFFFFFF)
TrianglesCache[TrianglesCacheCnt++] = HIdx
}
}
//
// Rest of kernel that uses indices from TrianglesCache follows for other calculations follows
//
}

if (threadIdx.x == 0)
Â Â TrianglesCacheCnt = 0;
__syncthreads();

the syncthreads here is not necessary either, and there is also no point in initializing TrianglesCacheCnt so early.

Â __syncthreads();
Â
Â //
Â // Let's first thread in each warp write indices from all threads in this warp
Â //
Â if ((threadIdx.x & 0x1F) == 0)
Â {
Â for (int k = 0; k < 32; k++)
Â {
Â unsigned int HIdx = PerThreadIDX[threadIdx.x+k];
Â if (HIdx != 0xFFFFFFFF)
Â Â TrianglesCache[TrianglesCacheCnt++] = HIdx
Â }
Â }
Â
//
// Rest of kernel that uses indices from TrianglesCache follows for other calculations follows
//
}

“if ((threadIdx.x & 0x1F) == 0)” is true for each thread index divisible by 32,

which is the first thread of each warp. You want only the first warp to run, which

is e.g. “if ((threadIdx.x & ~0x1F) == 0)” or “if ((threadIdx.x >> 5) == 0)”.

Also either all threads should continue to execute the code, then you need one more syncthreads after generating the cache or if one warp alone is enough for the remainder do

if (threadIdx.x >> 5) return;

Note that the way you did it now you assume a warp size of 32, which might not be true for future devices. Assuming that warp size will stay a power of two is probably resonable, so I’d do something like this:

so if i understand your post corectly I’v end up with code below.

i need 128 or 256 threads in block for other computations later in this kernel.

so my loop looks like this now (pseudocode):

shared TriangleCache[];
#if DEVICE_IS_1.2_CAPABLE
shared Count = 0;
for (i = 0; i < NumberOfTriangles; i+=256)
{
if (TriangleIsInsideBox(i+ThreadID))
TriangleCache[atomicAdd(&Count, 1)] = i+ThreadID;
}
#else
shared CurrentIdx[];
CacheCount = 0;
for (i = 0; i < NumberOfTriangles; i+=256)
{
CurrentIdx[ThreadID] = -1;
if (TriangleIsInsideBox(i+ThreadID))
CurrentIdx[ThreadID] = i+ThreadID;
__syncthreads(); // wait for all threads to fill CurrentIdx
if (ThreadID == 0)
{
// only first thread in first warp writes all (possibly 256) values into cache
for (int t = 0; t < 256; t++)
{
if (CurrentIdx[t] != -1)
TriangleCache[CacheCount++] = CurrentIdx[t];
}
}
__syncthreads(); // wait for thread0 to finish their work
}
#endif

i’v tested this code on 1.2 device, both works fine, bot the performance of kernel with second block is very very poor compared to the first one, is there any other way to improve this ?

Well, you noted correctly that it is nonsense to use a full warp for setting the cache, it probably will not get any faster (I was thinking too general, not about your special case).

Also CacheCount of course must be shared. I do not know if this applies to the real code, but the way your pseudocode looks, you should move the last syncthreads outside the for loop: that will be faster and it will also work when NumberOfTriangles is 0 and you have some code after the loop.

Why CacheCount must be shared ? it is really used only by thread0

in real kernel i know in advance what number of triangles is in the box, i just dont know witch ones.

Why the second __syncthreads() should be outside loop ? IMHO the one at the end of ‘for’ block must remain since there may be situation when thread0 still reads from CurrentIdx and other threads that do not process this block writes some values to CurrentIdx destroying previous ones.

Or this situation is completlly impossible due to divergences in execution ?

thread0 is in warp0, but thread32, 33 … are in warp1, divergent code is serialized only inside warps right ? so thread32 may freelly execute next ‘for’ iteration without sync at end of block ?