Global lockup tables in shared memory

Hi,

I’d like to store some (hard to compute and frequently accessed) lookup tables in shared memory. All the threads of the grid will need to access the same lockup table data, although for every access each thread might need to access a different element. (If a particular access in all threads always wanted to access the same lockup table element, I guess constant memory would be my best option)

As far as I understand, lifetime of shared memory is limited to the life time of the block… And each block will be allocated it’s private part of the available SM shared memory. So, if I want to keep two active, runnable blocks on each SM, they will each have to use less than half of the total shared memory available, although each half will store exactly the same information, right?

Is there any way to avoid this? I.e, any way to state that all threads will want the same data stored in shared memory, so that blocks allocated to the same SM can happily share the space?

/Lars

I don’t think there is anyway of doing what you want to do, though I can see why you’d want to do it.

I think your only options are to either use constant memory, or duplicate it in shared memory.

[quot]
if I want to keep two active, runnable blocks on each SM, they will each have to use less than half of the total shared memory available,
although each half will store exactly the same information, right?
[/quot]

shared memory is working space of a block in one SM. it cannot be initilized (constant memory can be initilized).

How do you store your lookup table into shared memory, via global memory?

right, initialization could be a problem,

why don’t you try using textures ? they are cached at each SM, so if the entire grid uses the same lookup table, this could be a good option

Yes, I’m just thinking about it at this stage, but I was planning to read the lockup tables into shared memory by coalesced transfers at the beginning of the kernel.

/L

Actually, I’m using textures to access the lockup tables right now. It works ok, but for a couple of reasons, I still would like to store the most frequently accessed parts of the lockup tables in shared memory. (One advantage is that although the total lockup table data set is large, I know what portions of it will be most frequently accessed)

First of all, my kernel is already simultaneously accessing another large data set using textures, and I guess these accesses will be competing for texture cache space with the (texture based) lockup table accesses, and also cause unnecessary cache conflicts.

Second, my kernel does not use (and seems unlikely to benefit from) shared memory for anything else, so I have all this fast on-chip memory sitting unused, unless I can store my lockup tables in it.

I guess I will still try to store the most frequently accessed parts of the lockup tables in shared memory, but if I want to be able to keep two active blocks per SM, it means that I will only be able to store half as much information since the second block will be storing a private shared mem copy of exactly the same data… Unless there is some workaround to make blocks share common shared memory for the lifetime of a kernel?

If not, unless there are some fundamental hardware limits stopping this, it would be a nice feature for future CUDA versions to be able to define and initialize a part of shared memory as common to and shared between all blocks, and with a lifetime of the kernel instead of a lifetime of the block. Would it be possible?

Still, in the meantime, any ideas for a workaround?

/Lars

[Moved to keep the other thread on-topic]

Still 8 kB is 50% of your whole table. So it really is a matter of: do your threads of 1 block access more than 50% of the lookup table. I would like to bring up an “old” saying on this forum: benchmark it.

It really is the only way to know what works best. What I can tell you is that from all the memory types constant memory gets you the highest throughput according to benchmarks made by MrAnderson42 (even higher than shared memory)

Also the actual cache might be located at Texture Processing Cluster level, so that would mean that you have 24 kB of cache available per 3 Multiprocessors (GT200 architecture), which would mean all of your lookup-table fits. So I would really advice to benchmark.