Device vs. Emulation Kernel problem

I’ve got two kernels to order the nodes in a binary tree.

[codebox]global void UpPass(Nodes p_Nodes, uint p_Elements, uint* p_NodeSizes, uint p_CurLevel)

{

// calculate parent node's index

uint tIndex = __mul24(blockIdx.x, blockDim.x) + threadIdx.x;

if(tIndex >= p_Elements) return;

// check if the node is at the current tree level

if(g_TreeLevel[tIndex] != p_CurLevel) return;

// get child indices

int2 childs = p_Nodes.m_Childs[tIndex];

uint curNodeSize = 1;



// add leaf sizes

if(childs.x >= 0)

	curNodeSize += p_NodeSizes[childs.x];

if(childs.y >= 0)

	curNodeSize += p_NodeSizes[childs.y];

// write back current node size

p_NodeSizes[tIndex] = curNodeSize;

}

global void DownPass(DevicePhotons p_Photons, Nodes p_Nodes, uint p_Elements,

					 uint* p_NodeSizes, uint* p_NodeAddresses, KdTreeNodes p_KdTree, uint p_CurLevel)

{

// calculate parent node's index

uint tIndex = __mul24(blockIdx.x, blockDim.x) + threadIdx.x;

if(tIndex >= p_Elements) return;

// check if the node is at the current tree level

if(g_TreeLevel[tIndex] != p_CurLevel) return;

// get current node address

uint curAddress = p_NodeAddresses[tIndex];

// get child indices

int2 childs = p_Nodes.m_Childs[tIndex];

// add leaf sizes

int2 newChilds;

if(childs.x >= 0)

{

	newChilds.x = curAddress + 1;

	p_NodeAddresses[childs.x] = newChilds.x;

}

if(childs.y >= 0)

{

	newChilds.y = curAddress + 1 + p_NodeSizes[childs.x];

	p_NodeAddresses[childs.y] = newChilds.y;

}

// write node data to final node address curAddress

   ...

}[/codebox]

g_TreeLevel is a global array that contains each nodes tree depth. The first Kernel (UpPass) starts at the deepest nodes and works up to the root. This kernel works nicely and returns the correct data in p_NodeSizes. The second Kernel (DownPass) starts at the rood and works down to the deepest nodes (both Kernels are executed within loops). However the second Kernel fails and when I take a look at the p_NodeAddresses array I see that there are a few invalid indexes (random numbers). The above code works nicely in Emulation Mode - but when run on the device it produces these faulty results (I also checked the child indexes in both - Emulation and Device mode and they are fine). Do I need any additional synchronization in my Kernels??