=====================================================================EDIT=======================================================================
Already fixed it. It was just an wrong comparision on updateFront kernel. Code edited also to reflect changes.
================================================================================================================================================
First some background. I’m trying to track the front of a BVTT using CUDA code. In the begin, my system has a 3D model object which has a Bounding Volume Hierarchy (BVH), a tree. For the first frame a full traversal (from the root node) is made on it. This traversal is done by setting work queues with pairs (i,j) where i and j are indices to BVH nodes. The collision is evaluated per pair while traversing the tree by a CUDA kernel. While the traversal is being done, the front of the BVTT is computed by adding nodes (i,j) to a front array if i and j are indices to BVH nodes that the traversal advances no further from them. i and j can be internal BVH nodes which collision was evaluated to false, they can be leaf nodes or a combination of leaf node and internal node. For all next frames the front is used as a starting point for the traversal, trying to use temporal coherence. For the next frame a front update is needed, which controls how the front feeds the work queues for next traversal. More details about front tracking on http://gamma.cs.unc.edu/PCD/spm09.pdf (short paper) and http://gamma.cs.unc.edu/PCD/pccd.pdf (full paper). Detail abouts the work queues and traversal can be found in http://gamma.cs.unc.edu/GPUCOL/gProximity.pdf .
Now to my system itself. In its previous stable version, the front update is being done by just copying the entire front to work queues and running the traversal kernel to evaluate work queues and add nodes to the front. This approach wastes some resources because not the entire front array have useful nodes. Now my efforts are to do a more optimized version of the front update, running a kernel before traversal which evaluates only useful nodes, feeding work queues with them and even feeding an output triangle pair array if BVH leaf nodes are found in front array. I’ve already wrote this optimized code but some bugs were introduced also that are driving me mad to find!
Now to code. First we have the CPU function CUDA_BVHCollide which manages collision and calls front update and traversal kernels. Both kernels updateFront and traverseTree are inside a loop and have some code after to balance buffers since buffers are segmented by blocks. These segments can be full and the balancing evenly distributes elements on them to avoid overflows. CUDA_balanceFront is the CPU function that manages balancing for all buffers, calling balanceFront kernel.
ColInfoGroup* CUDA_BVHCollide(ModelInstance* model1)
{
OBBNode* obbTree1 = (OBBNode*)model1->obbTree;
SceneVertexList h_vertices = model1->verlistFinal;
GPUVertex* d_vertices1 = (GPUVertex*)model1->vertexPointer;
GPUVertex* d_vertices2 = (GPUVertex*)model1->vertexPointerFinal;
GPUVertex* d_deltaPos = (GPUVertex*)model1->deltaPosPointer;
GPUEdge* d_edges = (GPUEdge*)model1->edgePointer;
GPUTriEdge* d_triEdges = (GPUTriEdge*)model1->triEdgesPointer;
uint3* d_triIndices1 = (uint3*)model1->triIdxPointer;
unsigned int* d_collisionPairIndex = NULL; //Index of the pair inside the collision queue
unsigned int* d_collisionSync = NULL;
unsigned int* d_puiIdleFrontCount = NULL;
unsigned int* d_nWorkQueueElements = NULL;
int2* d_collisionPairs = NULL;
int2* d_workQueues = NULL; //All the work queues in a one big array. The two indices
int2* d_workQueues2 = NULL; //indicates the BV pair
unsigned int* d_workQueueCounts = NULL; //Each index has the number of elements of a work queue (per block)
int* d_balanceSignal = NULL;
// allocate collision list (try to be conservative)
unsigned int collisionPairCapacity = COLLISION_PAIR_CAPACITY;
GPUMALLOC((void**)&d_collisionPairs, sizeof(int2) * collisionPairCapacity);
GPUMALLOC((void **)&d_collisionPairIndex, sizeof(int));
GPUMALLOC((void **)&d_nWorkQueueElements, sizeof(int));
GPUMALLOC((void **)&d_collisionSync, sizeof(int));
GPUMALLOC(&d_puiIdleFrontCount , sizeof(int));
// allocate work queues
GPUMALLOC((void **)&d_workQueues, sizeof(int2)*QUEUE_NTASKS*QUEUE_SIZE_PER_TASK_GLOBAL);
GPUMALLOC((void **)&d_workQueues2, sizeof(int2)*QUEUE_NTASKS*QUEUE_SIZE_PER_TASK_GLOBAL);
GPUMALLOC((void **)&d_workQueueCounts, sizeof(int)*QUEUE_NTASKS);
GPUMEMSET(d_workQueueCounts, 0, sizeof(int) * QUEUE_NTASKS);
GPUMEMSET(d_collisionPairIndex, 0, sizeof(int));
GPUMALLOC((void**)&d_balanceSignal, sizeof(int));
int2* d_aFront = (int2*) model1->frontPointer;
unsigned int* d_auiFrontCounts = (unsigned int*) model1->frontCountsPointer;
int2* d_aOutputFront;
GPUMALLOC(&d_aOutputFront , sizeof(int2)*QUEUE_NTASKS*FRONT_GLOBAL_SIZE);
bool bFrontUpdateNeeded = d_aFront ? true : false;
if(!bFrontUpdateNeeded)
{ //Full traversal needed. Allocate and set buffers properly
unsigned int uiFrontCountsSize = sizeof(unsigned int)*QUEUE_NTASKS;
unsigned int uiFrontSize = sizeof(int2)*QUEUE_NTASKS*FRONT_GLOBAL_SIZE;
GPUMALLOC(&d_aFront , uiFrontSize);
GPUMALLOC(&d_auiFrontCounts , uiFrontCountsSize);
GPUMEMSET(d_auiFrontCounts , 0 , uiFrontCountsSize);
// init first work element:
GPUMEMSET(d_workQueues, 0, sizeof(int2));
unsigned int firstCount = 1;
TOGPU(d_workQueueCounts, &firstCount, sizeof(unsigned int));
}
else
{ //Update front
/*//Uncomment this code to update the front by just copying front to work queues
GPUTOGPU(d_workQueues , d_aFront , sizeof(int2)*QUEUE_NTASKS*FRONT_GLOBAL_SIZE);
GPUTOGPU(d_workQueueCounts , d_auiFrontCounts , sizeof(unsigned int)*QUEUE_NTASKS)
GPUMEMSET(d_auiFrontCounts , 0 , sizeof(unsigned int)*QUEUE_NTASKS);*/
//Uncomment this code the other front update
unsigned int *d_auiOutputFrontIdx;
unsigned int uiFrontCountsSize = sizeof(unsigned int)*QUEUE_NTASKS;
GPUMALLOC(&d_auiOutputFrontIdx , uiFrontCountsSize);
GPUMEMSET(d_auiOutputFrontIdx , 0 , uiFrontCountsSize);
int2 *d_aOutputFront2;
GPUMALLOC(&d_aOutputFront2 , sizeof(int2)*FRONT_GLOBAL_SIZE*QUEUE_NTASKS);
unsigned int *d_puiIdleInputFront;
GPUMALLOC(&d_puiIdleInputFront , sizeof(unsigned int));
//CUDA_updateFront<OBBNode , OBB>(&d_aFront , &d_aOutputFront , &d_aOutputFront2 , d_auiFrontCounts , d_auiOutputFrontIdx ,
//&d_workQueues , &d_workQueues2 , d_workQueueCounts , d_collisionPairs , d_collisionPairIndex , obbTree1 , d_collisionSync ,
//d_puiIdleInputFront , d_puiIdleFrontCount , d_triIndices1);
//starting CUDA_updateFront
//Debug
int2 *h_aOutputFront = (int2*)malloc(sizeof(int2)*QUEUE_NTASKS*FRONT_GLOBAL_SIZE);
unsigned int h_auiOutputFrontIdx[QUEUE_NTASKS];
//
while(true)
{
GPUMEMSET(d_collisionSync, 0, sizeof(int));
GPUTOGPU(d_puiIdleInputFront , d_collisionSync , sizeof(unsigned int));
GPUTOGPU(d_puiIdleFrontCount , d_collisionSync , sizeof(unsigned int));
updateFront<OBBNode , OBB , QUEUE_SIZE_PER_TASK_GLOBAL , FRONT_GLOBAL_SIZE><<<QUEUE_NTASKS, FRONT_UPDATE_THREADS>>>
(d_aFront , d_aOutputFront , d_auiFrontCounts , d_auiOutputFrontIdx , d_workQueues , d_workQueueCounts ,
d_collisionPairs , d_collisionPairIndex , obbTree1 , d_collisionSync , d_puiIdleInputFront ,d_puiIdleFrontCount ,
d_triIndices1);
unsigned int h_puiIdleInputFrontCount;
FROMGPU(&h_puiIdleInputFrontCount , d_puiIdleInputFront , sizeof(unsigned int));
if(h_puiIdleInputFrontCount == QUEUE_NTASKS)
{
//All nodes from input front segments were processed.
break;
}
unsigned int h_puiIdleWorkQueuesCount;
FROMGPU(&h_puiIdleWorkQueuesCount , d_collisionSync , sizeof(unsigned int));
if(h_puiIdleWorkQueuesCount > QUEUE_IDLETASKS_FOR_ABORT ||
(h_puiIdleInputFrontCount > QUEUE_IDLETASKS_FOR_ABORT && h_puiIdleWorkQueuesCount != 0))
{
cout << "update front: work queues balancing..." << endl;
#ifdef PROFILE_DEBUG
startTimer(&timer);
#endif
CUDA_balanceFront(d_workQueues , d_workQueues2 , d_workQueueCounts);
int2* pTempWorkQueues = d_workQueues2;
d_workQueues2 = d_workQueues;
d_workQueues = pTempWorkQueues;
#ifdef PROFILE_DEBUG
endTimer("balanceWorkList" , &timer);
#endif
}
unsigned int h_puiIdleOutputFrontCount;
FROMGPU(&h_puiIdleOutputFrontCount , d_puiIdleFrontCount , sizeof(unsigned int));
if(h_puiIdleOutputFrontCount > QUEUE_IDLETASKS_FOR_ABORT ||
(h_puiIdleInputFrontCount > QUEUE_IDLETASKS_FOR_ABORT && h_puiIdleOutputFrontCount != 0))
{
cout << "update front: output front balancing..." << endl;
#ifdef PROFILE_DEBUG
startTimer(&timer);
#endif
//Debug
dummyKernel<int2 , int2 , unsigned int , int , int><<<1,32>>>(d_aOutputFront , d_aOutputFront2 , d_auiOutputFrontIdx);
//
CUDA_balanceFront(d_aOutputFront , d_aOutputFront2 , d_auiOutputFrontIdx);
int2* pTempFront = d_aOutputFront2;
d_aOutputFront2 = d_aOutputFront;
d_aOutputFront = pTempFront;
//Debug
dummyKernel<int2 , int2 , unsigned int , int , int><<<1,32>>>(d_aOutputFront , d_aOutputFront2 , d_auiOutputFrontIdx);
//
#ifdef PROFILE_DEBUG
endTimer("balanceFront" , &timer);
#endif
}
}
//Debug
safeFree(h_aOutputFront);
//end CUDA_updateFront
GPUFREE(d_aOutputFront2);
GPUFREE(d_auiFrontCounts);
int2 *d_aTempFront = d_aOutputFront;
d_aOutputFront = d_aFront;
d_aFront = d_aTempFront;
d_auiFrontCounts = d_auiOutputFrontIdx;
}
bool bActiveSplits = true;
unsigned int nVertices = model1->nVerts;
unsigned int nPairs = 0;
bool bFrontBalance = false;
#ifdef PROFILE_DEBUG
TimerValue timer;
timer.start();
#endif
while(bActiveSplits)
{
GPUMEMSET(d_collisionSync, 0, sizeof(int));
GPUTOGPU(d_puiIdleFrontCount , d_collisionSync , sizeof(int));
#ifdef PROFILE_DEBUG
unsigned int timer = 0;
startTimer(&timer);
#endif
traverseTree<OBBNode, OBB, QUEUE_SIZE_PER_TASK_GLOBAL, QUEUE_SIZE_PER_TASK_INIT, TRAVERSAL_THREADS , FRONT_GLOBAL_SIZE>
<<< QUEUE_NTASKS, TRAVERSAL_THREADS>>>
(obbTree1, d_vertices1, d_triIndices1, d_workQueues, d_workQueueCounts, d_collisionSync,
QUEUE_SIZE_PER_TASK_GLOBAL, d_collisionPairs, d_collisionPairIndex , d_aFront , d_auiFrontCounts , d_puiIdleFrontCount);
cudaThreadSynchronize();
//dummyKernel<int2 , unsigned int , int , int , int><<<1,32>>>(d_aFront , d_auiFrontCounts);
#ifdef PROFILE_DEBUG
endTimer("traverseTree" , &timer);
#endif
CUT_CHECK_ERROR("Traversal error\n");
unsigned int h_collisionSync;
FROMGPU(&h_collisionSync , d_collisionSync , sizeof(unsigned int));
if(h_collisionSync > QUEUE_IDLETASKS_FOR_ABORT)
{
#ifdef PROFILE_DEBUG
startTimer(&timer);
#endif
//balanceWorkList<BALANCE_THREADS, QUEUE_NTASKS, int2> <<< 1, BALANCE_THREADS>>>(d_workQueues, d_workQueues2, d_workQueueCounts, QUEUE_SIZE_PER_TASK_GLOBAL, d_nWorkQueueElements, d_balanceSignal);
//cudaThreadSynchronize();
unsigned int *d_puiTotalNodes = CUDA_balanceFront(d_workQueues , d_workQueues2 , d_workQueueCounts);
int2* pTempWorkQueues = d_workQueues2;
d_workQueues2 = d_workQueues;
d_workQueues = pTempWorkQueues;
#ifdef PROFILE_DEBUG
endTimer("balanceWorkList" , &timer);
#endif
unsigned int h_uiTotalNodes;
FROMGPU(&h_uiTotalNodes , d_puiTotalNodes , sizeof(unsigned int));
if(h_uiTotalNodes == 0)
{
bActiveSplits = false;
break;
}
}
unsigned int h_uiIdleFrontCount;
FROMGPU(&h_uiIdleFrontCount , d_puiIdleFrontCount , sizeof(unsigned int));
if(h_uiIdleFrontCount > QUEUE_IDLETASKS_FOR_ABORT)
{
#ifdef PROFILE_DEBUG
startTimer(&timer);
#endif
dummyKernel<int2 , int2 , unsigned int , int , int><<<1,32>>>(d_aFront , d_aOutputFront , d_auiFrontCounts);
CUDA_balanceFront(d_aFront , d_aOutputFront , d_auiFrontCounts);
dummyKernel<int2 , int2 , unsigned int , int , int><<<1,32>>>(d_aFront , d_aOutputFront , d_auiFrontCounts);
int2* pTempFront = d_aOutputFront;
d_aOutputFront = d_aFront;
d_aFront = pTempFront;
#ifdef PROFILE_DEBUG
endTimer("balanceFront" , &timer);
#endif
}
#ifdef TRAVERSAL_DEBUG
showTraversalDebugInfo(d_workQueues2, d_workQueueCounts, d_nWorkQueueElements, d_collisionSync, d_collisionPairs, d_collisionPairIndex);
//showBalanceDebugInfo(d_workQueues, d_workQueues2, d_workQueueCounts, QUEUE_SIZE_PER_TASK_GLOBAL, d_nWorkQueueElements, d_balanceSignal);
#endif
}
model1->frontPointer = d_aFront;
model1->frontCountsPointer = d_auiFrontCounts;
#ifdef PROFILE_DEBUG
timer.end("Traversal + Balancing");
g_fTotalTraversalTime += timer.getElapsedMs();
cout << "average traversal time = " << g_fTotalTraversalTime/(++g_uiNFrames) << endl;
#endif
FROMGPU(&nPairs, d_collisionPairIndex, sizeof(int));
//Free memory for triangle pair collision query
GPUFREE(d_workQueueCounts);
GPUFREE(d_workQueues);
GPUFREE(d_workQueues2);
GPUFREE(d_collisionPairIndex);
GPUFREE(d_nWorkQueueElements);
GPUFREE(d_collisionSync);
GPUFREE(d_puiIdleFrontCount);
GPUFREE(d_balanceSignal);
GPUFREE(d_aOutputFront);
//... we can don't care with code after this point
}
Now we have CUDA_BVHCollide associated kernels. First, updateFront and auxiliary functions.
template<typename TreeNode , typename BV , unsigned int uiGlobalWorkQueueCapacity , unsigned int uiGlobalFrontCapacity>
void __global__ updateFront(int2 *inputFront , int2 *outputFront , unsigned int *auiInputFrontIdx , unsigned int *auiOutputFrontIdx ,
int2* work_queues , unsigned int* auiWork_queue_counts , int2 *auiCollisionPairs , unsigned int *uiCollisionPairsIdx ,
const TreeNode* tree , unsigned int *puiIdleCount , unsigned int *puiIdleInputFront , unsigned int *puiIdleFrontCount ,
uint3 *aTriIndices)
{
const unsigned int uiBlockOffset = blockIdx.x;
const unsigned int uiThreadOffset = threadIdx.x;
__shared__ unsigned int uiInputFrontIdx;
__shared__ unsigned int uiOutputFrontIdx;
__shared__ unsigned int uiWorkQueueIdx;
__shared__ unsigned int uiNThreads;
__shared__ unsigned int wantToAbort;
__shared__ unsigned int uiWantToAbortFront;
if(uiThreadOffset == 0)
{
uiInputFrontIdx = auiInputFrontIdx[uiBlockOffset];
uiOutputFrontIdx = auiOutputFrontIdx[uiBlockOffset];
uiWorkQueueIdx = auiWork_queue_counts[uiBlockOffset];
uiNThreads = blockDim.x;
}
__syncthreads();
if(uiInputFrontIdx == 0)
{
callAbort<QUEUE_NTASKS>(puiIdleInputFront , uiThreadOffset);
return;
}
__syncthreads();
while(uiInputFrontIdx > 0)
{
int2 bvttNode;
int nActive = min(uiNThreads, uiInputFrontIdx);
bvttNode.x = -1;
if(uiThreadOffset < uiInputFrontIdx)
{
bvttNode = inputFront[uiBlockOffset*FRONT_GLOBAL_SIZE + uiInputFrontIdx - nActive + uiThreadOffset];
}
__syncthreads();
if(uiThreadOffset == 0)
{
uiInputFrontIdx -= nActive;
}
__syncthreads();
if(bvttNode.x != -1)
{
TreeNode node0 = tree[bvttNode.x];
TreeNode node1 = tree[bvttNode.y];
if(node0.isLeaf() && node1.isLeaf())
{
int iTri0 = node0.getTriID();
int iTri1 = node1.getTriID();
uint3 idx0 = aTriIndices[iTri0];
uint3 idx1 = aTriIndices[iTri1];
//Add just the non adjacent triangles
if( idx0.x != idx1.x && idx0.x != idx1.y && idx0.x != idx1.z &&
idx0.y != idx1.x && idx0.y != idx1.y && idx0.y != idx1.z &&
idx0.z != idx1.x && idx0.z != idx1.y && idx0.z != idx1.z)
{
if(intersect<BV>(node0.bbox, node1.bbox))
{
unsigned int uiColIdx = atomicAdd(uiCollisionPairsIdx , 1);
int2 pair;
pair.x = iTri0;
pair.y = iTri1;
auiCollisionPairs[uiColIdx] = pair;
}
addToFront(bvttNode , outputFront , uiOutputFrontIdx , uiBlockOffset , uiGlobalFrontCapacity);
}
}
else if(node1.isLeaf() || (!node0.isLeaf() && (node0.bbox.getSize() > node1.bbox.getSize())))
{ // node0 is bigger. Subdivide it.
if(intersect<BV>(node0.bbox, node1.bbox))
{
int2 workItem;
workItem.x = bvttNode.y;
workItem.y = bvttNode.x + node0.getLeftChild();
addToWorkQueue(uiWorkQueueIdx , work_queues , workItem , uiGlobalWorkQueueCapacity , uiBlockOffset);
}
else
{
addToFront(bvttNode , outputFront , uiOutputFrontIdx , uiBlockOffset , uiGlobalFrontCapacity);
}
}
else
{ // node1 is bigger. Subdivide it.
if(intersect<BV>(node0.bbox, node1.bbox))
{
int2 workItem;
workItem.x = bvttNode.x;
workItem.y = bvttNode.y + node1.getLeftChild();
addToWorkQueue(uiWorkQueueIdx , work_queues , workItem , uiGlobalWorkQueueCapacity , uiBlockOffset);
}
else
{
addToFront(bvttNode , outputFront , uiOutputFrontIdx , uiBlockOffset , uiGlobalFrontCapacity);
}
}
}
__syncthreads();
if(uiInputFrontIdx == 0)
{
callAbort<QUEUE_NTASKS>(puiIdleInputFront , uiThreadOffset);
break;
}
if((uiWorkQueueIdx >= QUEUE_SIZE_PER_TASK_GLOBAL - uiNThreads * 2))
{
callAbort<QUEUE_NTASKS>(puiIdleCount, uiThreadOffset);
break;
}
if(uiOutputFrontIdx > uiGlobalFrontCapacity - uiNThreads * 2)
{
callAbort<QUEUE_NTASKS>(puiIdleFrontCount , uiThreadOffset);
break;
}
if(uiThreadOffset == 0)
{
wantToAbort = *puiIdleCount;
uiWantToAbortFront = *puiIdleFrontCount;
}
__syncthreads();
if(wantToAbort > QUEUE_IDLETASKS_FOR_ABORT)
{
callAbort<QUEUE_NTASKS>(puiIdleCount, uiThreadOffset);
break;
}
if(uiWantToAbortFront > QUEUE_IDLETASKS_FOR_ABORT)
{
callAbort<QUEUE_NTASKS>(puiIdleFrontCount, uiThreadOffset);
break;
}
}
if(uiThreadOffset == 0)
{
auiOutputFrontIdx[uiBlockOffset] = uiOutputFrontIdx;
auiInputFrontIdx[uiBlockOffset] = uiInputFrontIdx;
auiWork_queue_counts[uiBlockOffset] = uiWorkQueueIdx;
}
}
template <int nTotalProcessors>
static __device__ __inline__ void callAbort(unsigned int *workQueueCounter, const int threadID)
{
if(threadID == 0)
atomicInc(workQueueCounter, nTotalProcessors);
}
__device__ __inline__ void addToFront(const int2 &work_item , int2 *aGlobalFront , unsigned int &uiGlobalFrontIdx ,
const unsigned int &uiBlockOffset , const unsigned int &iGlobalFrontCapacity)
{
unsigned int uiGlobalIdx = atomicAdd(&uiGlobalFrontIdx , 1);
aGlobalFront[uiBlockOffset * iGlobalFrontCapacity + uiGlobalIdx] = work_item;
}
void __device__ __inline__ addToWorkQueue(unsigned int &uiWorkQueueIdx , int2* work_queues , int2 &workItem ,
const unsigned int uiGlobalWorkQueueCapacity , const unsigned int &uiBlockOffset)
{
unsigned int uiWorkIdx = atomicAdd(&uiWorkQueueIdx , 2);
unsigned int uiWorkItemOffset = uiBlockOffset*uiGlobalWorkQueueCapacity + uiWorkIdx;
work_queues[uiWorkItemOffset] = workItem;
++workItem.y;
work_queues[uiWorkItemOffset + 1] = workItem;
}
Now we have the CPU function CUDA_balanceFront and associated kernels.
//Do front balancing
inline
unsigned int* CUDA_balanceFront(int2 *d_aFront , int2 *d_aOutputFront , unsigned int *d_auiFrontCounts)
{
//Prefix sum the front counters
CUDPPHandle theCudpp;
cudppCreate(&theCudpp);
CUDPPConfiguration config;
config.op = CUDPP_ADD;
config.datatype = CUDPP_UINT;
config.algorithm = CUDPP_SCAN;
config.options = CUDPP_OPTION_FORWARD | CUDPP_OPTION_EXCLUSIVE;
CUDPPHandle scanPlan = 0;
CUDPP_SAFE_CALL(cudppPlan(theCudpp, &scanPlan, config, QUEUE_NTASKS, 1, 0));
unsigned int uiFrontPrefixSumSize = sizeof(unsigned int)*QUEUE_NTASKS;
unsigned int *d_auiFrontPrefixSum;
GPUMALLOC(&d_auiFrontPrefixSum , uiFrontPrefixSumSize);
CUDPP_SAFE_CALL(cudppScan(scanPlan , d_auiFrontPrefixSum , d_auiFrontCounts , QUEUE_NTASKS));
unsigned int *d_puiTotalSum;
GPUMALLOC(&d_puiTotalSum , sizeof(unsigned int));
GPUTOGPU(d_puiTotalSum , d_auiFrontPrefixSum + QUEUE_NTASKS - 1, sizeof(unsigned int));
unsigned int *d_puiLastCount;
GPUMALLOC(&d_puiLastCount , sizeof(unsigned int));
GPUTOGPU(d_puiLastCount , d_auiFrontCounts + QUEUE_NTASKS - 1, sizeof(unsigned int));
unsigned int *d_auiTempFrontCounts;
GPUMALLOC(&d_auiTempFrontCounts , uiFrontPrefixSumSize);
GPUTOGPU(d_auiTempFrontCounts , d_auiFrontCounts , uiFrontPrefixSumSize);
unsigned int *d_puiNSegmentsWithMoreNodes;
GPUMALLOC(&d_puiNSegmentsWithMoreNodes , sizeof(unsigned int));
unsigned int *d_puiNodesPerSegment;
GPUMALLOC(&d_puiNodesPerSegment , sizeof(unsigned int));
recomputeCounters<<<1 , QUEUE_NTASKS>>>(d_auiFrontCounts , d_puiTotalSum , d_puiLastCount , d_puiNSegmentsWithMoreNodes ,
d_puiNodesPerSegment);
CUT_CHECK_ERROR("recomputeCounters error.\n");
balanceFront<<<QUEUE_NTASKS , BALANCE_THREADS>>>(d_aFront , d_aOutputFront , d_auiTempFrontCounts , d_auiFrontPrefixSum ,
d_puiNodesPerSegment , d_puiNSegmentsWithMoreNodes);
CUT_CHECK_ERROR("balanceFront error.\n");
CUDPP_SAFE_CALL(cudppDestroyPlan(scanPlan));
cudppDestroy(theCudpp);
GPUFREE(d_puiNSegmentsWithMoreNodes);
GPUFREE(d_puiNodesPerSegment);
GPUFREE(d_auiTempFrontCounts);
GPUFREE(d_auiFrontPrefixSum);
return d_puiTotalSum;
}
void __global__ recomputeCounters(unsigned int *auiFrontCounts , unsigned int *uiTotalSum , const unsigned int *uiLastCount ,
unsigned int *puiNSegmentsWithMoreNodes , unsigned int *puiNodesPerSegment)
{
__shared__ unsigned int uiTotalNodes;
__shared__ unsigned int uiNodesPerSegment;
__shared__ unsigned int uiRemainder;
unsigned int uiThreadOffset = threadIdx.x;
if(uiThreadOffset == 0)
{
uiTotalNodes = *uiTotalSum + *uiLastCount;
uiNodesPerSegment = uiTotalNodes/QUEUE_NTASKS;
uiRemainder = uiTotalNodes % QUEUE_NTASKS;
}
__syncthreads();
if(uiThreadOffset < uiRemainder)
{
auiFrontCounts[uiThreadOffset] = uiNodesPerSegment + 1;
}
else
{
auiFrontCounts[uiThreadOffset] = uiNodesPerSegment;
}
if(uiThreadOffset == 0)
{
*puiNSegmentsWithMoreNodes = uiRemainder;
*puiNodesPerSegment = uiNodesPerSegment + 1;
*uiTotalSum = uiTotalNodes;
}
}
void __global__ balanceFront(const int2 *inputFront , int2 *outputFront , const unsigned int *auiFrontCounts ,
const unsigned int *auiPrefixSum , const unsigned int *puiNodesPerSegment , const unsigned int *puiNSegmentsWithMoreNodes)
{
unsigned int uiBlockOffset = blockIdx.x;
unsigned int uiThreadOffset = threadIdx.x;
unsigned int uiNThreads = blockDim.x;
__shared__ unsigned int uiFrontCount;
__shared__ unsigned int uiCurrPrefixSum;
__shared__ unsigned int uiNSegmentsWithMoreNodes;
__shared__ unsigned int uiSegsWithMoreNodesTimesNodesPerSeg;
__shared__ unsigned int uiNNodesPerSegment;
__shared__ unsigned int uiNNodesPerSegmentMinus1;
if(uiThreadOffset == 0)
{
uiFrontCount = auiFrontCounts[uiBlockOffset];
uiNSegmentsWithMoreNodes = *puiNSegmentsWithMoreNodes;
uiNNodesPerSegment = *puiNodesPerSegment;
uiNNodesPerSegmentMinus1 = uiNNodesPerSegment - 1;
uiSegsWithMoreNodesTimesNodesPerSeg = uiNSegmentsWithMoreNodes*uiNNodesPerSegment;
uiCurrPrefixSum = auiPrefixSum[uiBlockOffset];
/*
if(uiBlockOffset < uiNSegmentsWithMoreNodes)
{ //Consider just the segments that should have more elements
uiCurrPrefixSum = uiBlockOffset*uiNNodesPerSegment;
}
else
{ //Consider segments that should have less elements too
uiCurrPrefixSum = uiSegsWithMoreNodesTimesNodesPerSeg +
(uiBlockOffset - uiNSegmentsWithMoreNodes)*(uiNNodesPerSegmentMinus1);
}*/
}
__syncthreads();
int2 bvttNode;
unsigned int uiCurrSumPlusThread = uiCurrPrefixSum + uiThreadOffset;
unsigned int uiTargetSegmentIdx = uiCurrSumPlusThread / uiNNodesPerSegment;
unsigned int uiTargetThreadOffset;
if(uiTargetSegmentIdx < uiNSegmentsWithMoreNodes)
{ //Consider just the segments that should have more elements
uiTargetThreadOffset = uiCurrSumPlusThread % uiNNodesPerSegment;
}
else
{ //Consider segments that should have less elements too
unsigned int uiAfterSegsWithMoreOffset = uiCurrSumPlusThread - uiSegsWithMoreNodesTimesNodesPerSeg;
uiTargetSegmentIdx = uiNSegmentsWithMoreNodes + uiAfterSegsWithMoreOffset/uiNNodesPerSegmentMinus1;
uiTargetThreadOffset = uiAfterSegsWithMoreOffset % uiNNodesPerSegmentMinus1;
}
while(uiThreadOffset < uiFrontCount)
{
if(uiTargetSegmentIdx < uiNSegmentsWithMoreNodes)
{
if(uiTargetThreadOffset >= uiNNodesPerSegment)
{
uiCurrSumPlusThread = uiCurrPrefixSum + uiThreadOffset;
uiTargetSegmentIdx = uiCurrSumPlusThread / uiNNodesPerSegment;
if(uiTargetSegmentIdx < uiNSegmentsWithMoreNodes)
{ //Consider just the segments that should have more elements
uiTargetThreadOffset = uiCurrSumPlusThread % uiNNodesPerSegment;
}
else
{ //Consider segments that should have less elements too
unsigned int uiAfterSegsWithMoreOffset = uiCurrSumPlusThread - uiSegsWithMoreNodesTimesNodesPerSeg;
uiTargetSegmentIdx = uiNSegmentsWithMoreNodes + uiAfterSegsWithMoreOffset/uiNNodesPerSegmentMinus1;
uiTargetThreadOffset = uiAfterSegsWithMoreOffset % uiNNodesPerSegmentMinus1;
}
}
}
else
{
if(uiTargetThreadOffset >= uiNNodesPerSegmentMinus1)
{
uiCurrSumPlusThread = uiCurrPrefixSum + uiThreadOffset;
unsigned int uiAfterSegsWithMoreOffset = uiCurrSumPlusThread - uiSegsWithMoreNodesTimesNodesPerSeg;
uiTargetSegmentIdx = uiNSegmentsWithMoreNodes + uiAfterSegsWithMoreOffset/uiNNodesPerSegmentMinus1;
uiTargetThreadOffset = uiAfterSegsWithMoreOffset % uiNNodesPerSegmentMinus1;
}
}
bvttNode = inputFront[uiBlockOffset*FRONT_GLOBAL_SIZE + uiThreadOffset];
outputFront[uiTargetSegmentIdx*FRONT_GLOBAL_SIZE + uiTargetThreadOffset] = bvttNode;
uiThreadOffset += uiNThreads;
uiTargetThreadOffset += uiNThreads;
}
}
I need help to figure out what is wrong with the front update part of CUDA_BVHCollide. I know the problem is in that part of the code, since when I uncomment the part that just copy entire front to work queues and comment the optimized front update it does well. Assuming that this code is problematic, I think there are more probability that the error is in updateFront kernel or CUDA_balanceFront and associated kernels. I’m almost a week tracking this problem, but no avail. Any help will be very appreciated.