Table lookup with cuda

wuninsu · June 30, 2011, 2:13am

Hello.
I am making some table lookup with cuda.
In my case, the table size is too big. So I cannot use the cache well.
But the performance is too low (~ 0.17 gbps)
I wonder how much the bandwidth is without the cache. :)
Do you have some idea?

hyqneuron · June 30, 2011, 1:25pm

If you know how to optimize memory access, you’ll get around 80 to 130 GB/s for most devices.

parallelis · July 5, 2011, 2:42am

I suppose it is 170 MBps, or at least 40 Millions IOps, and it’s not so bad depending on the num,ber of threads you launched (I suppose it’s monothreaded for your tests).

If you have to do global memory look-up, you’d better try to have a maximum parallel threads runnings (until you hit the limits where your threads will use “local” memory that is global memory too).

One trick that I use, is to have one warp per SM that do pre-fetching from global memory while the other threads are for processing, to try hiding global memory latency. You may limit the number of concurrent processing warps to limit to 1 instead 6, loosing on high-latency operations (such as MUL) to optimize the number of available registers and “shared” memory.

wuninsu · July 5, 2011, 4:59am

Do you know any good resources for that? :) Then, plz let me know

wuninsu · July 5, 2011, 5:01am

I don’t know what the prefetcing means. :) Do you have some example code?

wuninsu · July 5, 2011, 5:11am

My program reads multiple string, and match with hundreds DFA.

DFA table is too large, so it is hard to control the coalsced memory read or using shared memory.

One thread read one string and match with one table.

I think that reading string is not the heavy operation.

Because about 64 threads read same memory, I heard if the computational compatibility is over 2.0.

Warp will read one memory and broadcast it. (Am I right?)

I think I have to control the that DFA table, but it is too big to use shared memory.

And I guess that while the match, the reading address will be changed randomly.

So I think it is not good way to use Texture memory (Texture memory is global memory with cache? Am I right? )

So… I am stuck :)

hyqneuron · July 5, 2011, 10:42am

You can take a look at the SDK’s matrix multiplication example and the whitepaper that comes with it.

Topic		Replies	Views
Effective global memory bandwidth? CUDA Programming and Performance	17	17688	September 18, 2007
Memory management issues Global and Shared memory management CUDA Programming and Performance	12	3982	March 2, 2009
Effective bandwidth between using shared memory and global memory CUDA Programming and Performance	0	389	August 2, 2020
memory bandwidth device to SM bandwidth CUDA Programming and Performance	9	4814	June 10, 2008
global memory bandwidth problem CUDA Programming and Performance	4	1463	March 2, 2010
Lookup table Where to implement? CUDA Programming and Performance	6	2364	March 22, 2010
Global memory access bottleneck CUDA Programming and Performance	8	3585	September 4, 2015
Optimizing App Memory Bandwidth Requirements Optimizing App Memory Bandwidth Requirem CUDA Programming and Performance	7	7697	May 7, 2008
Global Memory Bandwidth. CUDA Programming and Performance	0	838	December 8, 2009
price for lookup tables CUDA Programming and Performance	7	7370	February 12, 2008

Table lookup with cuda

Related topics