Table Lookups Data set too large to fit in constant/shared mem

Hi all,

I have a kernel which spends approximately 33% of its time performing a binary search operation to do a table lookup.

This search is performed on an array of Energy values (sorted, but do not have a constant difference between the each successive energy) to find out what array index my energy of interest is nearest to.

However, things that complicate its successful implementation in CUDA:

  1. this “energy mesh” is very large: ~131kb in a sample problem I am toying with, a real problem would be substantially larger.

  2. access will not be coalesced, even amongst threads because the energy of interest for each thread is essentially random.

Does anyone have any experience/lessons learned/suggestions regarding attacking this type of problem? I have a texture search and the regular array search functions pasted below. The texture search is quirky for the time being and seems to take even longer.

I’ll be honest, I dont quite get what Section D.3 of the programming guide is telling me.

[codebox]device unsigned int textureSearch(unsigned int first, unsigned int last, float key, unsigned int loc)

{

unsigned int return_val=0;

while ((first <= last)&&(return_val==0))

{

   unsigned int mid = (first + last) / 2;  

   if (key > tex1Dfetch(big_Emesh_t,mid))

       first = mid + 1;  

   else if (key < tex1Dfetch(big_Emesh_t,mid))

       last = mid - 1; 

   else

       return_val= mid;     

}

if (return_val==0)

	return_val= last+1;    

return_val-=big_Emesh_offsets_d[loc];

return return_val;

}

device unsigned int binarySearch(float* sortedArray, unsigned int first, unsigned int last, float key)

{

while (first <= last)

{

   unsigned int mid = (first + last) / 2; 

   if (key > sortedArray[mid])

       first = mid + 1;  

   else if (key < sortedArray[mid])

       last = mid - 1; 

   else

       return mid;     

}

return last+1;   

}

[/codebox]

Thanks all,

Adam

You use Cached Texture or Global Memory read for your array search, and yes they won’t be coalesced access after the first access.
(And I don’t see any way to coalesce these tests).

You have done a first optimisation using cached texture. But this cache isn’t magic.

For myself, I will consider using a two-way strategy:

  • first rounds from shared memory (that will contain n/2 value, n/2±n/4 value, … ) to speed up first part of the array lookup
    (with 16KB and 1 int32 per entry, you will have up to 4096 entries, so will left 32 bytes to read from you real array, 4096*32 = 128KB)
  • then do a linear read of these 32bytes (8 int32) at once from global memory into register
  • and select the value from these registers using “stupid” non-indexed code :-)

You will end-up with one Global memory access of 256bits that won’t be coalesced but with only one-time the latency.
I always prefer to read sequentially a big chunk if possible instead doing multiple pseudo-random reads.

Ok, I see what you are saying: Partition the large array into a smaller array that has as large a resolution that will fit in shared mem. Then just determine which indices of this smaller array your value is inside, and use those as the ‘first’ and ‘last’ of the binary search. Sounds like it should help.

I will try that later tonight. Thanks.

Let me know if it’s improve the performance-level, I wrote some stupid things on my blogs about CUDA technologies and way to improve performances ;-)