Do dedicated shared memory and unified L1/Texture cache share the same bandwidth (Pascal)?

In Pascal, we have dedicated shared memory and unified L1/texture cache. But are they physically the same with logical division into two parts (and thus share the same 32 banks and bandwidth)? Or they are physically isolated, each with its own connection/bandwidth to other function units?

they are separate entities.

I don’t think bandwidth numbers for these are published. They can be discovered experimentally, perhaps.

http://docs.nvidia.com/cuda/pascal-tuning-guide/index.html#l1-cache

In general, niether of these items have connections to function units. Function units are serviced by the register file. These items are generally part of the memory/cache path (the LD/ST path).

Oh, you’re right. But do they share the same path to registers?

I’m asking because I wrote a program which is bound by shared memory bandwidth and I’m sure there’s no memory bank conflict. I’m considering leaving part of the data in global memory and access with __ldg(), and the rest in shared memory. If shared memory and L1 cache do not shared the bandwidth to register file, I may be able to double the performance with this approach.

I heard that L1 cache is not optimised for temporal locality. I guess that means the L1 cache has very few cache lines per cache set? maybe directly mapped? Even in this case, I think by tuning the access pattern, I’ll still be able to avoid cache conflict and utilise temporal locality. I’ll need to figure out how many cache sets L1 contains, though.

As you said, maybe I have to test experimentally.

I don’t know specifically about Pascal, but in general loads through the L1 or texture caches go through a different path than loads from shared memory.

However, it is quite possible that the approach you are contemplating is not going to result in much of a noticeable performance increase, since the latency of global memory access is much higher, and bandwidth likely significantly lower compared to shared memory. Unless the data fits into the combined L1/texture cache (small, but high bandwidth), I would anticipate at most a minor performance boost.

Since details of the microachitecture, including operation of the caches, are not publicly documented, I would advocate an experimental approach instead of theorizing. In other words, just give your idea a try and see what happens. You might also want to ponder how to maximize the use of the register file in place of shared memory (e.g. by use of register blocking), as this is the highest performance memory available on the GPU.

I’ll make sure there is no frequent access to L2/global. In each SM, I’ll make sure only a few (maybe ~8) consecutive 128-byte memory blocks of the global memory is access repeatedly, I guess in this case these small memory blocks should stay in L1 cache. The threads will repeatedly access these cache lines.

On Maxwell and Pascal architecture the shared memory and the texture cache are separate execution paths; however, they share register file bandwidth for instruction operands and write back. I do not think you will find that you are register file limited. Please note that the texture cache is in order return so if you mix in any misses all hits after the miss will be delayed by the miss. Shared memory provides a lower average latency than texture.