Tracking of a Bounding Volume Test Tree (BVTT) front with CUDA - Continuous Collision Detection - Co

=====================================================================EDIT=======================================================================

Already fixed it. It was just an wrong comparision on updateFront kernel. Code edited also to reflect changes.

================================================================================================================================================

First some background. I’m trying to track the front of a BVTT using CUDA code. In the begin, my system has a 3D model object which has a Bounding Volume Hierarchy (BVH), a tree. For the first frame a full traversal (from the root node) is made on it. This traversal is done by setting work queues with pairs (i,j) where i and j are indices to BVH nodes. The collision is evaluated per pair while traversing the tree by a CUDA kernel. While the traversal is being done, the front of the BVTT is computed by adding nodes (i,j) to a front array if i and j are indices to BVH nodes that the traversal advances no further from them. i and j can be internal BVH nodes which collision was evaluated to false, they can be leaf nodes or a combination of leaf node and internal node. For all next frames the front is used as a starting point for the traversal, trying to use temporal coherence. For the next frame a front update is needed, which controls how the front feeds the work queues for next traversal. More details about front tracking on http://gamma.cs.unc.edu/PCD/spm09.pdf (short paper) and http://gamma.cs.unc.edu/PCD/pccd.pdf (full paper). Detail abouts the work queues and traversal can be found in http://gamma.cs.unc.edu/GPUCOL/gProximity.pdf .

Now to my system itself. In its previous stable version, the front update is being done by just copying the entire front to work queues and running the traversal kernel to evaluate work queues and add nodes to the front. This approach wastes some resources because not the entire front array have useful nodes. Now my efforts are to do a more optimized version of the front update, running a kernel before traversal which evaluates only useful nodes, feeding work queues with them and even feeding an output triangle pair array if BVH leaf nodes are found in front array. I’ve already wrote this optimized code but some bugs were introduced also that are driving me mad to find!

Now to code. First we have the CPU function CUDA_BVHCollide which manages collision and calls front update and traversal kernels. Both kernels updateFront and traverseTree are inside a loop and have some code after to balance buffers since buffers are segmented by blocks. These segments can be full and the balancing evenly distributes elements on them to avoid overflows. CUDA_balanceFront is the CPU function that manages balancing for all buffers, calling balanceFront kernel.

ColInfoGroup* CUDA_BVHCollide(ModelInstance* model1)

    {

	OBBNode*		obbTree1 = (OBBNode*)model1->obbTree;

	

	SceneVertexList	h_vertices = model1->verlistFinal;

	GPUVertex*		d_vertices1 = (GPUVertex*)model1->vertexPointer;

	GPUVertex*		d_vertices2 = (GPUVertex*)model1->vertexPointerFinal;

	GPUVertex*		d_deltaPos = (GPUVertex*)model1->deltaPosPointer;

	GPUEdge*		d_edges = (GPUEdge*)model1->edgePointer;

	GPUTriEdge*		d_triEdges = (GPUTriEdge*)model1->triEdgesPointer;

	uint3*			d_triIndices1 = (uint3*)model1->triIdxPointer;

	unsigned int*	d_collisionPairIndex = NULL;	//Index of the pair inside the collision queue

	unsigned int*	d_collisionSync = NULL;

	unsigned int*	d_puiIdleFrontCount = NULL;

	unsigned int*	d_nWorkQueueElements = NULL;

	int2*			d_collisionPairs = NULL;

	int2*			d_workQueues = NULL;			//All the work queues in a one big array. The two indices

	int2*			d_workQueues2 = NULL;			//indicates the BV pair

	unsigned int*	d_workQueueCounts = NULL;		//Each index has the number of elements of a work queue (per block)

	int*			d_balanceSignal = NULL;

	

	// allocate collision list (try to be conservative)

	unsigned int	collisionPairCapacity = COLLISION_PAIR_CAPACITY;

	GPUMALLOC((void**)&d_collisionPairs, sizeof(int2) * collisionPairCapacity);

	GPUMALLOC((void **)&d_collisionPairIndex, sizeof(int));

	GPUMALLOC((void **)&d_nWorkQueueElements, sizeof(int));

	GPUMALLOC((void **)&d_collisionSync, sizeof(int));

	GPUMALLOC(&d_puiIdleFrontCount , sizeof(int));

	

	// allocate work queues

	GPUMALLOC((void **)&d_workQueues, sizeof(int2)*QUEUE_NTASKS*QUEUE_SIZE_PER_TASK_GLOBAL);

	GPUMALLOC((void **)&d_workQueues2, sizeof(int2)*QUEUE_NTASKS*QUEUE_SIZE_PER_TASK_GLOBAL);

	GPUMALLOC((void **)&d_workQueueCounts, sizeof(int)*QUEUE_NTASKS);

	GPUMEMSET(d_workQueueCounts, 0, sizeof(int) * QUEUE_NTASKS);

	GPUMEMSET(d_collisionPairIndex, 0, sizeof(int));

	GPUMALLOC((void**)&d_balanceSignal, sizeof(int));

	int2*			d_aFront = (int2*) model1->frontPointer;

	unsigned int*	d_auiFrontCounts = (unsigned int*) model1->frontCountsPointer;

	int2*			d_aOutputFront;

	GPUMALLOC(&d_aOutputFront , sizeof(int2)*QUEUE_NTASKS*FRONT_GLOBAL_SIZE);

	bool bFrontUpdateNeeded = d_aFront ? true : false;

	

	if(!bFrontUpdateNeeded)

	{	//Full traversal needed. Allocate and set buffers properly

		unsigned int uiFrontCountsSize = sizeof(unsigned int)*QUEUE_NTASKS;

		unsigned int uiFrontSize = sizeof(int2)*QUEUE_NTASKS*FRONT_GLOBAL_SIZE;

		GPUMALLOC(&d_aFront , uiFrontSize);

		GPUMALLOC(&d_auiFrontCounts , uiFrontCountsSize);

		GPUMEMSET(d_auiFrontCounts , 0 , uiFrontCountsSize);

		// init first work element:

		GPUMEMSET(d_workQueues, 0, sizeof(int2));

		unsigned int firstCount = 1;

		TOGPU(d_workQueueCounts, &firstCount, sizeof(unsigned int));

	}

	else

	{	//Update front

		/*//Uncomment this code to update the front by just copying front to work queues

		GPUTOGPU(d_workQueues , d_aFront , sizeof(int2)*QUEUE_NTASKS*FRONT_GLOBAL_SIZE);

		GPUTOGPU(d_workQueueCounts , d_auiFrontCounts , sizeof(unsigned int)*QUEUE_NTASKS)

		GPUMEMSET(d_auiFrontCounts , 0 , sizeof(unsigned int)*QUEUE_NTASKS);*/

		

		//Uncomment this code the other front update

		unsigned int *d_auiOutputFrontIdx;

		unsigned int uiFrontCountsSize = sizeof(unsigned int)*QUEUE_NTASKS;

		GPUMALLOC(&d_auiOutputFrontIdx , uiFrontCountsSize);

		GPUMEMSET(d_auiOutputFrontIdx , 0 , uiFrontCountsSize);

		int2 *d_aOutputFront2;

		GPUMALLOC(&d_aOutputFront2 , sizeof(int2)*FRONT_GLOBAL_SIZE*QUEUE_NTASKS);

		unsigned int *d_puiIdleInputFront;

		GPUMALLOC(&d_puiIdleInputFront , sizeof(unsigned int));

		//CUDA_updateFront<OBBNode , OBB>(&d_aFront , &d_aOutputFront , &d_aOutputFront2 , d_auiFrontCounts , d_auiOutputFrontIdx ,

			//&d_workQueues , &d_workQueues2 , d_workQueueCounts , d_collisionPairs , d_collisionPairIndex , obbTree1 , d_collisionSync ,

			//d_puiIdleInputFront , d_puiIdleFrontCount , d_triIndices1);

		

		//starting CUDA_updateFront

		

		//Debug

		int2 *h_aOutputFront = (int2*)malloc(sizeof(int2)*QUEUE_NTASKS*FRONT_GLOBAL_SIZE);

		unsigned int h_auiOutputFrontIdx[QUEUE_NTASKS];

		//

		while(true)

		{

			GPUMEMSET(d_collisionSync, 0, sizeof(int));

			GPUTOGPU(d_puiIdleInputFront , d_collisionSync , sizeof(unsigned int));

			GPUTOGPU(d_puiIdleFrontCount , d_collisionSync , sizeof(unsigned int));

			updateFront<OBBNode , OBB , QUEUE_SIZE_PER_TASK_GLOBAL , FRONT_GLOBAL_SIZE><<<QUEUE_NTASKS, FRONT_UPDATE_THREADS>>>

				(d_aFront , d_aOutputFront , d_auiFrontCounts , d_auiOutputFrontIdx , d_workQueues , d_workQueueCounts ,

				d_collisionPairs , d_collisionPairIndex , obbTree1 , d_collisionSync , d_puiIdleInputFront ,d_puiIdleFrontCount , 

				d_triIndices1);

			unsigned int h_puiIdleInputFrontCount;

			FROMGPU(&h_puiIdleInputFrontCount , d_puiIdleInputFront , sizeof(unsigned int));

			if(h_puiIdleInputFrontCount == QUEUE_NTASKS)

			{

				//All nodes from input front segments were processed.

				break;

			}

			unsigned int h_puiIdleWorkQueuesCount;

			FROMGPU(&h_puiIdleWorkQueuesCount , d_collisionSync , sizeof(unsigned int));

			if(h_puiIdleWorkQueuesCount > QUEUE_IDLETASKS_FOR_ABORT ||

				(h_puiIdleInputFrontCount > QUEUE_IDLETASKS_FOR_ABORT && h_puiIdleWorkQueuesCount != 0))

			{

				cout << "update front: work queues balancing..." << endl;

				#ifdef PROFILE_DEBUG

					startTimer(&timer);

				#endif

			

				CUDA_balanceFront(d_workQueues , d_workQueues2 , d_workQueueCounts);

				int2* pTempWorkQueues = d_workQueues2;

				d_workQueues2 = d_workQueues;

				d_workQueues = pTempWorkQueues;

				#ifdef PROFILE_DEBUG

					endTimer("balanceWorkList" , &timer);

				#endif

			}

			unsigned int h_puiIdleOutputFrontCount;

			FROMGPU(&h_puiIdleOutputFrontCount , d_puiIdleFrontCount , sizeof(unsigned int));

			if(h_puiIdleOutputFrontCount > QUEUE_IDLETASKS_FOR_ABORT ||

				(h_puiIdleInputFrontCount > QUEUE_IDLETASKS_FOR_ABORT && h_puiIdleOutputFrontCount != 0))

			{

				cout << "update front: output front balancing..." << endl;

				#ifdef PROFILE_DEBUG

					startTimer(&timer);

				#endif

				//Debug

				dummyKernel<int2 , int2 , unsigned int , int , int><<<1,32>>>(d_aOutputFront , d_aOutputFront2 , d_auiOutputFrontIdx);

				//

				CUDA_balanceFront(d_aOutputFront , d_aOutputFront2 , d_auiOutputFrontIdx);

				int2* pTempFront = d_aOutputFront2;

				d_aOutputFront2 = d_aOutputFront;

				d_aOutputFront = pTempFront;

			

				//Debug

				dummyKernel<int2 , int2 , unsigned int , int , int><<<1,32>>>(d_aOutputFront , d_aOutputFront2 , d_auiOutputFrontIdx);

				//

				#ifdef PROFILE_DEBUG

					endTimer("balanceFront" , &timer);

				#endif

			}

		}

		//Debug

		safeFree(h_aOutputFront);

		

		//end CUDA_updateFront

		GPUFREE(d_aOutputFront2);

		GPUFREE(d_auiFrontCounts);

		int2 *d_aTempFront = d_aOutputFront;

		d_aOutputFront = d_aFront;

		d_aFront = d_aTempFront;

		d_auiFrontCounts = d_auiOutputFrontIdx;

	}

	bool bActiveSplits = true;

	unsigned int nVertices = model1->nVerts;

	unsigned int nPairs = 0;

	bool bFrontBalance = false;

	#ifdef PROFILE_DEBUG

		TimerValue timer;

		timer.start();

	#endif

	while(bActiveSplits)

	{	

		GPUMEMSET(d_collisionSync, 0, sizeof(int));

		GPUTOGPU(d_puiIdleFrontCount , d_collisionSync , sizeof(int));

		#ifdef PROFILE_DEBUG

			unsigned int timer = 0;

			startTimer(&timer);

		#endif

		traverseTree<OBBNode, OBB, QUEUE_SIZE_PER_TASK_GLOBAL, QUEUE_SIZE_PER_TASK_INIT, TRAVERSAL_THREADS , FRONT_GLOBAL_SIZE>

			<<< QUEUE_NTASKS, TRAVERSAL_THREADS>>>

			(obbTree1, d_vertices1, d_triIndices1, d_workQueues, d_workQueueCounts, d_collisionSync,

			QUEUE_SIZE_PER_TASK_GLOBAL, d_collisionPairs, d_collisionPairIndex , d_aFront , d_auiFrontCounts , d_puiIdleFrontCount);

		

		cudaThreadSynchronize();

		//dummyKernel<int2 , unsigned int , int , int , int><<<1,32>>>(d_aFront , d_auiFrontCounts);

		#ifdef PROFILE_DEBUG

			endTimer("traverseTree" , &timer);

		#endif

		CUT_CHECK_ERROR("Traversal error\n");

		unsigned int h_collisionSync;

		FROMGPU(&h_collisionSync , d_collisionSync , sizeof(unsigned int));

		if(h_collisionSync > QUEUE_IDLETASKS_FOR_ABORT)

		{

			#ifdef PROFILE_DEBUG

				startTimer(&timer);

			#endif

			

			//balanceWorkList<BALANCE_THREADS, QUEUE_NTASKS, int2> <<< 1, BALANCE_THREADS>>>(d_workQueues, d_workQueues2, d_workQueueCounts, QUEUE_SIZE_PER_TASK_GLOBAL, d_nWorkQueueElements, d_balanceSignal);

			//cudaThreadSynchronize();

			unsigned int *d_puiTotalNodes = CUDA_balanceFront(d_workQueues , d_workQueues2 , d_workQueueCounts);

			int2* pTempWorkQueues = d_workQueues2;

			d_workQueues2 = d_workQueues;

			d_workQueues = pTempWorkQueues;

			#ifdef PROFILE_DEBUG

				endTimer("balanceWorkList" , &timer);

			#endif

			unsigned int h_uiTotalNodes;

			FROMGPU(&h_uiTotalNodes , d_puiTotalNodes , sizeof(unsigned int));

			if(h_uiTotalNodes == 0)

			{

				bActiveSplits = false;

				break;

			}

		}

		unsigned int h_uiIdleFrontCount;

		FROMGPU(&h_uiIdleFrontCount , d_puiIdleFrontCount , sizeof(unsigned int));

		if(h_uiIdleFrontCount > QUEUE_IDLETASKS_FOR_ABORT)

		{

			#ifdef PROFILE_DEBUG

				startTimer(&timer);

			#endif

			

			dummyKernel<int2 , int2 , unsigned int , int , int><<<1,32>>>(d_aFront , d_aOutputFront , d_auiFrontCounts);

			

			CUDA_balanceFront(d_aFront , d_aOutputFront , d_auiFrontCounts);

			dummyKernel<int2 , int2 , unsigned int , int , int><<<1,32>>>(d_aFront , d_aOutputFront , d_auiFrontCounts);

			int2* pTempFront = d_aOutputFront;

			d_aOutputFront = d_aFront;

			d_aFront = pTempFront;

			#ifdef PROFILE_DEBUG

				endTimer("balanceFront" , &timer);

			#endif

		}

		#ifdef TRAVERSAL_DEBUG

			showTraversalDebugInfo(d_workQueues2, d_workQueueCounts, d_nWorkQueueElements, d_collisionSync, d_collisionPairs, d_collisionPairIndex);

			//showBalanceDebugInfo(d_workQueues, d_workQueues2, d_workQueueCounts, QUEUE_SIZE_PER_TASK_GLOBAL, d_nWorkQueueElements, d_balanceSignal);

		#endif

	}

	model1->frontPointer = d_aFront;

	model1->frontCountsPointer = d_auiFrontCounts;

	#ifdef PROFILE_DEBUG

		timer.end("Traversal + Balancing");

		g_fTotalTraversalTime += timer.getElapsedMs();

		cout << "average traversal time = " << g_fTotalTraversalTime/(++g_uiNFrames) << endl;

	#endif

	FROMGPU(&nPairs, d_collisionPairIndex, sizeof(int));

	//Free memory for triangle pair collision query

	GPUFREE(d_workQueueCounts);

	GPUFREE(d_workQueues);

	GPUFREE(d_workQueues2);

	GPUFREE(d_collisionPairIndex);

	GPUFREE(d_nWorkQueueElements);

	GPUFREE(d_collisionSync);

	GPUFREE(d_puiIdleFrontCount);

	GPUFREE(d_balanceSignal);

	GPUFREE(d_aOutputFront);

    //... we can don't care with code after this point

    }

Now we have CUDA_BVHCollide associated kernels. First, updateFront and auxiliary functions.

template<typename TreeNode , typename BV , unsigned int uiGlobalWorkQueueCapacity , unsigned int uiGlobalFrontCapacity>

    void __global__ updateFront(int2 *inputFront , int2 *outputFront , unsigned int *auiInputFrontIdx , unsigned int *auiOutputFrontIdx ,

	int2* work_queues , unsigned int* auiWork_queue_counts , int2 *auiCollisionPairs , unsigned int *uiCollisionPairsIdx ,

	const TreeNode* tree , unsigned int *puiIdleCount , unsigned int *puiIdleInputFront , unsigned int *puiIdleFrontCount , 

	uint3 *aTriIndices)

    {

	const unsigned int uiBlockOffset = blockIdx.x;

	const unsigned int uiThreadOffset = threadIdx.x;

	__shared__ unsigned int uiInputFrontIdx;

	__shared__ unsigned int uiOutputFrontIdx;

	__shared__ unsigned int uiWorkQueueIdx;

	__shared__ unsigned int uiNThreads;

	__shared__ unsigned int wantToAbort;

	__shared__ unsigned int uiWantToAbortFront;

	if(uiThreadOffset == 0)

	{

		uiInputFrontIdx = auiInputFrontIdx[uiBlockOffset];

		uiOutputFrontIdx = auiOutputFrontIdx[uiBlockOffset];

		uiWorkQueueIdx = auiWork_queue_counts[uiBlockOffset];

		uiNThreads = blockDim.x;

	}

	__syncthreads();

	if(uiInputFrontIdx == 0)

	{

		callAbort<QUEUE_NTASKS>(puiIdleInputFront , uiThreadOffset);

		return;

	}

	__syncthreads();

	while(uiInputFrontIdx > 0)

	{

		int2 bvttNode;

		int nActive = min(uiNThreads, uiInputFrontIdx);

		bvttNode.x = -1;

		

		if(uiThreadOffset < uiInputFrontIdx)

		{

			bvttNode = inputFront[uiBlockOffset*FRONT_GLOBAL_SIZE + uiInputFrontIdx - nActive + uiThreadOffset];

		}

		__syncthreads();

		if(uiThreadOffset == 0)

		{

			uiInputFrontIdx -= nActive;

		}

		__syncthreads();

		if(bvttNode.x != -1)

		{

			TreeNode node0 = tree[bvttNode.x];

			TreeNode node1 = tree[bvttNode.y];

			if(node0.isLeaf() && node1.isLeaf())

			{

				int iTri0 = node0.getTriID();

				int iTri1 = node1.getTriID();

				uint3 idx0 = aTriIndices[iTri0];

				uint3 idx1 = aTriIndices[iTri1];

				//Add just the non adjacent triangles

				if(	idx0.x != idx1.x && idx0.x != idx1.y && idx0.x != idx1.z &&

					idx0.y != idx1.x && idx0.y != idx1.y && idx0.y != idx1.z &&

					idx0.z != idx1.x && idx0.z != idx1.y && idx0.z != idx1.z)

				{

					if(intersect<BV>(node0.bbox, node1.bbox))

					{

						unsigned int uiColIdx = atomicAdd(uiCollisionPairsIdx , 1);

						int2 pair;

						pair.x = iTri0;

						pair.y = iTri1;

						auiCollisionPairs[uiColIdx] = pair;

					}

					addToFront(bvttNode , outputFront , uiOutputFrontIdx , uiBlockOffset , uiGlobalFrontCapacity);

				}

			}

			else if(node1.isLeaf() || (!node0.isLeaf() && (node0.bbox.getSize() > node1.bbox.getSize())))

			{ // node0 is bigger. Subdivide it.

				if(intersect<BV>(node0.bbox, node1.bbox))

				{

					int2 workItem;

					workItem.x = bvttNode.y;

					workItem.y = bvttNode.x + node0.getLeftChild();

					addToWorkQueue(uiWorkQueueIdx , work_queues , workItem , uiGlobalWorkQueueCapacity , uiBlockOffset);

				}

				else

				{

					addToFront(bvttNode , outputFront , uiOutputFrontIdx , uiBlockOffset , uiGlobalFrontCapacity);

				}

			}

			else

			{ // node1 is bigger. Subdivide it.

				if(intersect<BV>(node0.bbox, node1.bbox))

				{

					int2 workItem;

					workItem.x = bvttNode.x;

					workItem.y = bvttNode.y + node1.getLeftChild();

					addToWorkQueue(uiWorkQueueIdx , work_queues , workItem , uiGlobalWorkQueueCapacity , uiBlockOffset);

				}

				else

				{

					addToFront(bvttNode , outputFront , uiOutputFrontIdx , uiBlockOffset , uiGlobalFrontCapacity);

				}

			}

		}

		__syncthreads();

		if(uiInputFrontIdx == 0)

		{

			callAbort<QUEUE_NTASKS>(puiIdleInputFront , uiThreadOffset);

			break;

		}

		if((uiWorkQueueIdx >= QUEUE_SIZE_PER_TASK_GLOBAL - uiNThreads * 2))

		{

			callAbort<QUEUE_NTASKS>(puiIdleCount, uiThreadOffset);

			break;

		}

		if(uiOutputFrontIdx > uiGlobalFrontCapacity - uiNThreads * 2)

		{

			callAbort<QUEUE_NTASKS>(puiIdleFrontCount , uiThreadOffset);

			break;

		}

		if(uiThreadOffset == 0)

		{

			wantToAbort = *puiIdleCount;

			uiWantToAbortFront = *puiIdleFrontCount;

		}

		

		__syncthreads();

		if(wantToAbort > QUEUE_IDLETASKS_FOR_ABORT)

		{

			callAbort<QUEUE_NTASKS>(puiIdleCount, uiThreadOffset);

			break;

		}

		if(uiWantToAbortFront > QUEUE_IDLETASKS_FOR_ABORT)

		{

			callAbort<QUEUE_NTASKS>(puiIdleFrontCount, uiThreadOffset);

			break;

		}

	}

	if(uiThreadOffset == 0)

	{

		auiOutputFrontIdx[uiBlockOffset] = uiOutputFrontIdx;

		auiInputFrontIdx[uiBlockOffset] = uiInputFrontIdx;

		auiWork_queue_counts[uiBlockOffset] = uiWorkQueueIdx;

	}

    }

template <int nTotalProcessors>

    static __device__ __inline__ void callAbort(unsigned int *workQueueCounter, const int threadID)

    {

        if(threadID == 0)

		atomicInc(workQueueCounter, nTotalProcessors);

    }

__device__ __inline__ void addToFront(const int2 &work_item , int2 *aGlobalFront , unsigned int &uiGlobalFrontIdx ,

	const unsigned int &uiBlockOffset , const unsigned int &iGlobalFrontCapacity)

    {

	unsigned int uiGlobalIdx = atomicAdd(&uiGlobalFrontIdx , 1);

	aGlobalFront[uiBlockOffset * iGlobalFrontCapacity + uiGlobalIdx] = work_item;

    }

void __device__ __inline__ addToWorkQueue(unsigned int &uiWorkQueueIdx , int2* work_queues , int2 &workItem ,

	const unsigned int uiGlobalWorkQueueCapacity , const unsigned int &uiBlockOffset)

    {

	unsigned int uiWorkIdx = atomicAdd(&uiWorkQueueIdx , 2);

	unsigned int uiWorkItemOffset = uiBlockOffset*uiGlobalWorkQueueCapacity + uiWorkIdx;

	work_queues[uiWorkItemOffset] = workItem;

	++workItem.y;

	work_queues[uiWorkItemOffset + 1] = workItem;

    }

Now we have the CPU function CUDA_balanceFront and associated kernels.

//Do front balancing

    inline

    unsigned int* CUDA_balanceFront(int2 *d_aFront , int2 *d_aOutputFront , unsigned int *d_auiFrontCounts)

    {

	//Prefix sum the front counters

	CUDPPHandle theCudpp;

	cudppCreate(&theCudpp);

	CUDPPConfiguration config;

	config.op = CUDPP_ADD;

	config.datatype = CUDPP_UINT;

	config.algorithm = CUDPP_SCAN;

	config.options = CUDPP_OPTION_FORWARD | CUDPP_OPTION_EXCLUSIVE;

	CUDPPHandle scanPlan = 0;

	CUDPP_SAFE_CALL(cudppPlan(theCudpp, &scanPlan, config, QUEUE_NTASKS, 1, 0));

	

	unsigned int uiFrontPrefixSumSize = sizeof(unsigned int)*QUEUE_NTASKS;

	unsigned int *d_auiFrontPrefixSum;

	GPUMALLOC(&d_auiFrontPrefixSum , uiFrontPrefixSumSize);

	

	CUDPP_SAFE_CALL(cudppScan(scanPlan , d_auiFrontPrefixSum , d_auiFrontCounts , QUEUE_NTASKS));

	unsigned int *d_puiTotalSum;

	GPUMALLOC(&d_puiTotalSum , sizeof(unsigned int));

	GPUTOGPU(d_puiTotalSum , d_auiFrontPrefixSum + QUEUE_NTASKS - 1, sizeof(unsigned int));

		

	unsigned int *d_puiLastCount;

	GPUMALLOC(&d_puiLastCount , sizeof(unsigned int));

	GPUTOGPU(d_puiLastCount , d_auiFrontCounts + QUEUE_NTASKS - 1, sizeof(unsigned int));

	unsigned int *d_auiTempFrontCounts;

	GPUMALLOC(&d_auiTempFrontCounts , uiFrontPrefixSumSize);

	GPUTOGPU(d_auiTempFrontCounts , d_auiFrontCounts , uiFrontPrefixSumSize);

	unsigned int *d_puiNSegmentsWithMoreNodes;

	GPUMALLOC(&d_puiNSegmentsWithMoreNodes , sizeof(unsigned int));

	unsigned int *d_puiNodesPerSegment;

	GPUMALLOC(&d_puiNodesPerSegment , sizeof(unsigned int));

	recomputeCounters<<<1 , QUEUE_NTASKS>>>(d_auiFrontCounts , d_puiTotalSum , d_puiLastCount , d_puiNSegmentsWithMoreNodes ,

		d_puiNodesPerSegment);

	

	CUT_CHECK_ERROR("recomputeCounters error.\n");

	balanceFront<<<QUEUE_NTASKS , BALANCE_THREADS>>>(d_aFront , d_aOutputFront , d_auiTempFrontCounts , d_auiFrontPrefixSum , 

		d_puiNodesPerSegment , d_puiNSegmentsWithMoreNodes);

	CUT_CHECK_ERROR("balanceFront error.\n");

	CUDPP_SAFE_CALL(cudppDestroyPlan(scanPlan));

	cudppDestroy(theCudpp);

	GPUFREE(d_puiNSegmentsWithMoreNodes);

	GPUFREE(d_puiNodesPerSegment);

	GPUFREE(d_auiTempFrontCounts);

	GPUFREE(d_auiFrontPrefixSum);

	return d_puiTotalSum;

    }

void __global__ recomputeCounters(unsigned int *auiFrontCounts , unsigned int *uiTotalSum , const unsigned int *uiLastCount , 

	unsigned int *puiNSegmentsWithMoreNodes , unsigned int *puiNodesPerSegment)

    {

	__shared__ unsigned int uiTotalNodes;

	__shared__ unsigned int uiNodesPerSegment;

	__shared__ unsigned int uiRemainder;

	unsigned int uiThreadOffset = threadIdx.x;

	if(uiThreadOffset == 0)

	{

		uiTotalNodes = *uiTotalSum + *uiLastCount;

		uiNodesPerSegment = uiTotalNodes/QUEUE_NTASKS;

		uiRemainder = uiTotalNodes % QUEUE_NTASKS;

	}

	__syncthreads();

	if(uiThreadOffset < uiRemainder)

	{

		auiFrontCounts[uiThreadOffset] = uiNodesPerSegment + 1;

	}

	else

	{

		auiFrontCounts[uiThreadOffset] = uiNodesPerSegment;

	}

	if(uiThreadOffset == 0)

	{

		*puiNSegmentsWithMoreNodes = uiRemainder;

		*puiNodesPerSegment = uiNodesPerSegment + 1;

		*uiTotalSum = uiTotalNodes;

	}

    }

void __global__ balanceFront(const int2 *inputFront , int2 *outputFront , const unsigned int *auiFrontCounts ,

	const unsigned int *auiPrefixSum , const unsigned int *puiNodesPerSegment , const unsigned int *puiNSegmentsWithMoreNodes)

    {

	unsigned int uiBlockOffset = blockIdx.x;

	unsigned int uiThreadOffset = threadIdx.x;

	unsigned int uiNThreads = blockDim.x;

	__shared__ unsigned int uiFrontCount;

	__shared__ unsigned int uiCurrPrefixSum;

	__shared__ unsigned int uiNSegmentsWithMoreNodes;

	__shared__ unsigned int uiSegsWithMoreNodesTimesNodesPerSeg;

	__shared__ unsigned int uiNNodesPerSegment;

	__shared__ unsigned int uiNNodesPerSegmentMinus1;

	

	if(uiThreadOffset == 0)

	{

		uiFrontCount = auiFrontCounts[uiBlockOffset];

		uiNSegmentsWithMoreNodes = *puiNSegmentsWithMoreNodes;

		uiNNodesPerSegment = *puiNodesPerSegment;

		uiNNodesPerSegmentMinus1 = uiNNodesPerSegment - 1;

		uiSegsWithMoreNodesTimesNodesPerSeg = uiNSegmentsWithMoreNodes*uiNNodesPerSegment;

		uiCurrPrefixSum = auiPrefixSum[uiBlockOffset];

		/*

		if(uiBlockOffset < uiNSegmentsWithMoreNodes)

		{ //Consider just the segments that should have more elements

			uiCurrPrefixSum = uiBlockOffset*uiNNodesPerSegment;

		}

		else

		{ //Consider segments that should have less elements too

			uiCurrPrefixSum = uiSegsWithMoreNodesTimesNodesPerSeg +

				(uiBlockOffset - uiNSegmentsWithMoreNodes)*(uiNNodesPerSegmentMinus1);

		}*/

	}

	__syncthreads();

	

	int2 bvttNode;

	unsigned int uiCurrSumPlusThread = uiCurrPrefixSum + uiThreadOffset;

	unsigned int uiTargetSegmentIdx = uiCurrSumPlusThread / uiNNodesPerSegment;

	unsigned int uiTargetThreadOffset;

	if(uiTargetSegmentIdx < uiNSegmentsWithMoreNodes)

	{ //Consider just the segments that should have more elements

		uiTargetThreadOffset = uiCurrSumPlusThread % uiNNodesPerSegment;

	}

	else

	{ //Consider segments that should have less elements too

		unsigned int uiAfterSegsWithMoreOffset = uiCurrSumPlusThread - uiSegsWithMoreNodesTimesNodesPerSeg;

		uiTargetSegmentIdx = uiNSegmentsWithMoreNodes + uiAfterSegsWithMoreOffset/uiNNodesPerSegmentMinus1;

		uiTargetThreadOffset = uiAfterSegsWithMoreOffset % uiNNodesPerSegmentMinus1;

	}

	while(uiThreadOffset < uiFrontCount)

	{

		if(uiTargetSegmentIdx < uiNSegmentsWithMoreNodes)

		{

			if(uiTargetThreadOffset >= uiNNodesPerSegment)

			{

				uiCurrSumPlusThread = uiCurrPrefixSum + uiThreadOffset;

				uiTargetSegmentIdx = uiCurrSumPlusThread / uiNNodesPerSegment;

				if(uiTargetSegmentIdx < uiNSegmentsWithMoreNodes)

				{ //Consider just the segments that should have more elements

					uiTargetThreadOffset = uiCurrSumPlusThread % uiNNodesPerSegment;

				}

				else

				{ //Consider segments that should have less elements too

					unsigned int uiAfterSegsWithMoreOffset = uiCurrSumPlusThread - uiSegsWithMoreNodesTimesNodesPerSeg;

					uiTargetSegmentIdx = uiNSegmentsWithMoreNodes + uiAfterSegsWithMoreOffset/uiNNodesPerSegmentMinus1;

					uiTargetThreadOffset = uiAfterSegsWithMoreOffset % uiNNodesPerSegmentMinus1;

				}

			}

		}

		else

		{

			if(uiTargetThreadOffset >= uiNNodesPerSegmentMinus1)

			{

				uiCurrSumPlusThread = uiCurrPrefixSum + uiThreadOffset;

				unsigned int uiAfterSegsWithMoreOffset = uiCurrSumPlusThread - uiSegsWithMoreNodesTimesNodesPerSeg;

				uiTargetSegmentIdx = uiNSegmentsWithMoreNodes + uiAfterSegsWithMoreOffset/uiNNodesPerSegmentMinus1;

				uiTargetThreadOffset = uiAfterSegsWithMoreOffset % uiNNodesPerSegmentMinus1;

			}

		}

		

		bvttNode = inputFront[uiBlockOffset*FRONT_GLOBAL_SIZE + uiThreadOffset];

		outputFront[uiTargetSegmentIdx*FRONT_GLOBAL_SIZE + uiTargetThreadOffset] = bvttNode;

		uiThreadOffset += uiNThreads;

		uiTargetThreadOffset += uiNThreads;

	}

    }

I need help to figure out what is wrong with the front update part of CUDA_BVHCollide. I know the problem is in that part of the code, since when I uncomment the part that just copy entire front to work queues and comment the optimized front update it does well. Assuming that this code is problematic, I think there are more probability that the error is in updateFront kernel or CUDA_balanceFront and associated kernels. I’m almost a week tracking this problem, but no avail. Any help will be very appreciated.

Man that’s a huge code dump.

Maybe you could try reducing your example to the minimum code that still produces said bad behavior. Otherwise the chances are low that someone will jump in and try to understand and debug that piece of code.

Already solved.