help with low memory access throughput + algorithm GPU hierarchy traversal problem

tilo · July 13, 2011, 2:01pm

Hi.

I have implemented a GPU construction and traversal on the GPU. My bottleneck in the application is in the traversal code. More specifically in the end of the traversal where I look up the data for the leaf nodes and perform some computation.

With some profiling the memory throughput for my bottleneck kernel is around 1 GB/s on my GTX 460.

I think the slowdown is due to large amounts of memory accesses and probably uncoalesced accesses.

This is the kernel code simplified (for core idea), for the bottleneck parts:

__global__ void pickSamples(CUDA_MaxHeap_Interleaved* heaps, float *queries, const uint nUniqueQueryIndices, const uint nQueries,

							float *samples, const uint nSamples, uint *uniqueQueryIndices,

							uint *sampleIndices, const bool queriesUnique, uint *cnt,

							uint *uniqueQueryIndices_starts, uint *uniqueQueryIndices_counts)

{

	unsigned int threadId = (blockIdx.y * gridDim.x + blockIdx.x) * blockDim.x + threadIdx.x;

	if(threadId >= nUniqueQueryIndices) //N is the total amount of (query, nodeIdx) pairs

		return;

	uint uniqueQI = uniqueQueryIndices[threadId];

	//use uniqueQI to access of nQueries datastructure (interlieved data)

float dist;

	float* data = queries + uniqueQI;

	uint queryIndices_start = uniqueQueryIndices_starts[threadId];

	uint queryIndices_count = uniqueQueryIndices_counts[threadId];

	uint count;

	uint startIdx;

	uint c;

	uint sample_ind;

	for(uint j = 0; j < queryIndices_count; j++)

	{

		c = tex1Dfetch(texRef_nodeIndices, queryIndices_start+j);

		count = tex1Dfetch(texRef_uniqueCounts, c);

		startIdx = tex1Dfetch(texRef_uniqueStarts, c);

		c = count;

		for(uint i = 0; i < count; i++)

		{

			sample_ind = sampleIndices[startIdx+i];

//calculate distance results for a given query and sample

			dist = distance(data, samples + sample_ind, nQueries, nSamples, dim);

			//insert distance data to datastructure for corresponding uniqueQI

		} //end for

		cnt[uniqueQI] += c;

	} //end for

}

I have n queries and each of these traverses the tree hierarchy down to leaf nodes. Also each query can result in several leaf nodes being interesting (used later on in this kernel). So for a specific query index can traverse down to several leaf nodes.

So I have an uniqueQueryIndices array in order to access only the unique query samples. Supplied with that I have uniqueQueryIndices_starts, which tells where in the sorted nodeIndices array to start access and the count of it. NodeIndices stores the indices into a start and count array for samples that the leaf node bounds. So as an overall I have a nested for-loop.

The inner for-loop just calculates the distance of a query and a sample.

I used texture memory, but is does not show much improvement at all.

Does anyone know how I can improve on this?

Does anyone have experience with multi query tree traversal which results in multiple datas, and can share their experience?

Thanks

Topic		Replies	Views
Memory intensive CUDA benchmarks CUDA Programming and Performance	0	496	June 12, 2018
Fast texture reading fast memory reading would suffice! CUDA Programming and Performance	1	1522	April 16, 2008
simple question CUDA Programming and Performance	5	739	August 2, 2011
Finding performance bottlenecks CUDA Programming and Performance	0	4792	May 28, 2007
Unified memory CUDA Programming and Performance	2	755	November 11, 2019
Why is this a uint64_t subtraction and bit shift generating bottleneck? CUDA Programming and Performance	2	852	January 10, 2021
Slow sequential loops on CUDA CUDA Programming and Performance	3	674	April 18, 2023
[Help] Kernel Optimization Image subsampling CUDA Programming and Performance	2	4217	July 30, 2007
a not so simple question CUDA Programming and Performance	5	6066	November 2, 2007
Bad performance when using unified memory CUDA Programming and Performance	2	3417	April 21, 2019

help with low memory access throughput + algorithm GPU hierarchy traversal problem

Related topics