help with low memory access throughput + algorithm GPU hierarchy traversal problem


I have implemented a GPU construction and traversal on the GPU. My bottleneck in the application is in the traversal code. More specifically in the end of the traversal where I look up the data for the leaf nodes and perform some computation.

With some profiling the memory throughput for my bottleneck kernel is around 1 GB/s on my GTX 460.

I think the slowdown is due to large amounts of memory accesses and probably uncoalesced accesses.

This is the kernel code simplified (for core idea), for the bottleneck parts:

__global__ void pickSamples(CUDA_MaxHeap_Interleaved* heaps, float *queries, const uint nUniqueQueryIndices, const uint nQueries,

							float *samples, const uint nSamples, uint *uniqueQueryIndices,

							uint *sampleIndices, const bool queriesUnique, uint *cnt,

							uint *uniqueQueryIndices_starts, uint *uniqueQueryIndices_counts)


	unsigned int threadId = (blockIdx.y * gridDim.x + blockIdx.x) * blockDim.x + threadIdx.x;

	if(threadId >= nUniqueQueryIndices) //N is the total amount of (query, nodeIdx) pairs


	uint uniqueQI = uniqueQueryIndices[threadId];

	//use uniqueQI to access of nQueries datastructure (interlieved data)

float dist;

	float* data = queries + uniqueQI;

	uint queryIndices_start = uniqueQueryIndices_starts[threadId];

	uint queryIndices_count = uniqueQueryIndices_counts[threadId];

	uint count;

	uint startIdx;

	uint c;

	uint sample_ind;

	for(uint j = 0; j < queryIndices_count; j++)


		c = tex1Dfetch(texRef_nodeIndices, queryIndices_start+j);

		count = tex1Dfetch(texRef_uniqueCounts, c);

		startIdx = tex1Dfetch(texRef_uniqueStarts, c);

		c = count;

		for(uint i = 0; i < count; i++)


			sample_ind = sampleIndices[startIdx+i];

//calculate distance results for a given query and sample

			dist = distance(data, samples + sample_ind, nQueries, nSamples, dim);

			//insert distance data to datastructure for corresponding uniqueQI

		} //end for

		cnt[uniqueQI] += c;

	} //end for


I have n queries and each of these traverses the tree hierarchy down to leaf nodes. Also each query can result in several leaf nodes being interesting (used later on in this kernel). So for a specific query index can traverse down to several leaf nodes.

So I have an uniqueQueryIndices array in order to access only the unique query samples. Supplied with that I have uniqueQueryIndices_starts, which tells where in the sorted nodeIndices array to start access and the count of it. NodeIndices stores the indices into a start and count array for samples that the leaf node bounds. So as an overall I have a nested for-loop.

The inner for-loop just calculates the distance of a query and a sample.

I used texture memory, but is does not show much improvement at all.

Does anyone know how I can improve on this?

Does anyone have experience with multi query tree traversal which results in multiple datas, and can share their experience?