I am making some table lookup with cuda.
In my case, the table size is too big. So I cannot use the cache well.
But the performance is too low (~ 0.17 gbps)
I wonder how much the bandwidth is without the cache. :)
Do you have some idea?
If you know how to optimize memory access, you’ll get around 80 to 130 GB/s for most devices.
I suppose it is 170 MBps, or at least 40 Millions IOps, and it’s not so bad depending on the num,ber of threads you launched (I suppose it’s monothreaded for your tests).
If you have to do global memory look-up, you’d better try to have a maximum parallel threads runnings (until you hit the limits where your threads will use “local” memory that is global memory too).
One trick that I use, is to have one warp per SM that do pre-fetching from global memory while the other threads are for processing, to try hiding global memory latency. You may limit the number of concurrent processing warps to limit to 1 instead 6, loosing on high-latency operations (such as MUL) to optimize the number of available registers and “shared” memory.
Do you know any good resources for that? :) Then, plz let me know
I don’t know what the prefetcing means. :) Do you have some example code?
My program reads multiple string, and match with hundreds DFA.
DFA table is too large, so it is hard to control the coalsced memory read or using shared memory.
One thread read one string and match with one table.
I think that reading string is not the heavy operation.
Because about 64 threads read same memory, I heard if the computational compatibility is over 2.0.
Warp will read one memory and broadcast it. (Am I right?)
I think I have to control the that DFA table, but it is too big to use shared memory.
And I guess that while the match, the reading address will be changed randomly.
So I think it is not good way to use Texture memory (Texture memory is global memory with cache? Am I right? )
So… I am stuck :)
You can take a look at the SDK’s matrix multiplication example and the whitepaper that comes with it.