Table Lookups Data set too large to fit in constant/shared mem

laxsu19 · July 23, 2009, 1:51pm

Hi all,

I have a kernel which spends approximately 33% of its time performing a binary search operation to do a table lookup.

This search is performed on an array of Energy values (sorted, but do not have a constant difference between the each successive energy) to find out what array index my energy of interest is nearest to.

However, things that complicate its successful implementation in CUDA:

this “energy mesh” is very large: ~131kb in a sample problem I am toying with, a real problem would be substantially larger.
access will not be coalesced, even amongst threads because the energy of interest for each thread is essentially random.

Does anyone have any experience/lessons learned/suggestions regarding attacking this type of problem? I have a texture search and the regular array search functions pasted below. The texture search is quirky for the time being and seems to take even longer.

I’ll be honest, I dont quite get what Section D.3 of the programming guide is telling me.

[codebox]device unsigned int textureSearch(unsigned int first, unsigned int last, float key, unsigned int loc)

{

unsigned int return_val=0;

while ((first <= last)&&(return_val==0))

{

   unsigned int mid = (first + last) / 2;  

   if (key > tex1Dfetch(big_Emesh_t,mid))

       first = mid + 1;  

   else if (key < tex1Dfetch(big_Emesh_t,mid))

       last = mid - 1; 

   else

       return_val= mid;     

}

if (return_val==0)

	return_val= last+1;    

return_val-=big_Emesh_offsets_d[loc];

return return_val;

}

device unsigned int binarySearch(float* sortedArray, unsigned int first, unsigned int last, float key)

{

while (first <= last)

{

   unsigned int mid = (first + last) / 2; 

   if (key > sortedArray[mid])

       first = mid + 1;  

   else if (key < sortedArray[mid])

       last = mid - 1; 

   else

       return mid;     

}

return last+1;

}

[/codebox]

Thanks all,

Adam

parallelis · July 23, 2009, 3:53pm

You use Cached Texture or Global Memory read for your array search, and yes they won’t be coalesced access after the first access.
(And I don’t see any way to coalesce these tests).

You have done a first optimisation using cached texture. But this cache isn’t magic.

For myself, I will consider using a two-way strategy:

first rounds from shared memory (that will contain n/2 value, n/2±n/4 value, … ) to speed up first part of the array lookup
(with 16KB and 1 int32 per entry, you will have up to 4096 entries, so will left 32 bytes to read from you real array, 4096*32 = 128KB)
then do a linear read of these 32bytes (8 int32) at once from global memory into register
and select the value from these registers using “stupid” non-indexed code :-)

You will end-up with one Global memory access of 256bits that won’t be coalesced but with only one-time the latency.
I always prefer to read sequentially a big chunk if possible instead doing multiple pseudo-random reads.

laxsu19 · July 23, 2009, 5:10pm

You use Cached Texture or Global Memory read for your array search, and yes they won’t be coalesced access after the first access.

(And I don’t see any way to coalesce these tests).

You have done a first optimisation using cached texture. But this cache isn’t magic.

For myself, I will consider using a two-way strategy:

first rounds from shared memory (that will contain n/2 value, n/2±n/4 value, … ) to speed up first part of the array lookup

(with 16KB and 1 int32 per entry, you will have up to 4096 entries, so will left 32 bytes to read from you real array, 4096*32 = 128KB)

then do a linear read of these 32bytes (8 int32) at once from global memory into register

and select the value from these registers using “stupid” non-indexed code :-)

You will end-up with one Global memory access of 256bits that won’t be coalesced but with only one-time the latency.

I always prefer to read sequentially a big chunk if possible instead doing multiple pseudo-random reads.

Ok, I see what you are saying: Partition the large array into a smaller array that has as large a resolution that will fit in shared mem. Then just determine which indices of this smaller array your value is inside, and use those as the ‘first’ and ‘last’ of the binary search. Sounds like it should help.

I will try that later tonight. Thanks.

parallelis · July 24, 2009, 2:46pm

Let me know if it’s improve the performance-level, I wrote some stupid things on my blogs about CUDA technologies and way to improve performances ;-)

Topic		Replies	Views
searching, global memory access binary searching? CUDA Programming and Performance	0	2608	December 17, 2009
Memory management issues Global and Shared memory management CUDA Programming and Performance	12	3982	March 2, 2009
Would an application with large table/array read be accelerated by CUDA? CUDA Programming and Performance	1	435	September 25, 2016
Fastest Lookup Times CUDA Programming and Performance	5	1902	April 18, 2009
Shared memory question CUDA Programming and Performance	27	7596	June 23, 2008
Best way to allocate a small lookup table 2KB of data, read only CUDA Programming and Performance	7	2924	March 22, 2011
How to optimize CUDA program general method CUDA Programming and Performance	0	2769	January 21, 2009
Coalescing memory accesses Need help with coalescing CUDA Programming and Performance	2	1222	March 30, 2009
Coalesced Memory access related doubt CUDA Programming and Performance	13	2199	December 9, 2010
Lookup table in global memory Can I coallesce accesses? CUDA Programming and Performance	3	1330	May 14, 2009

Table Lookups Data set too large to fit in constant/shared mem

Related topics