For A100, will unallocated part of L1 cache preforms exactly like shared memory? For example, will it also cause bank conflict?
If I have some data to load and calculate several times by tensor core and there is no way to prevent bank conflict if I store them in shared memory, will it be a better choice to just load them from global memory (as they will be cached by L1 cache, if there are no bank conflict their)?
You have four parts at least
- allocated shared memory; dynamic and fixed size
- unallocated shared memory = unused due to fixed sizes for allocation
- a small part used by Nvidia depending on number of blocks
- L1 cache
The L1 cache does not have bank conflicts. It is for accesses into global memory with even stricter coalescing rules.
If you load the same data in the same way from shared memory, and you have bank conflicts, then you would have repeated bank conflicts.
If you have coalescing violations when loading from global memory, you have repeated coalescing violations.
Why not just keep the loaded data in registers?
Or store the data back to shared memory with a better arrangement?
In the end, try out variants and benchmark, which work best in your case.
There can be access variants, with which the L1 cache works best:
- Try to overall load aligned blocks of 32 bytes at the same time (can be split into threads)
- That means a maximum of 4 such 32 bytes blocks for a 32-bit access of each thread
- You can have 8 or 16 32 byte blocks for 64-bit or 128-bit accesses per thread
That means, with shared memory you have 32 independent addresses (except having to care for bank conflict), with L1 you have 4 independent addresses (or 8/16 for 64 or 128 bit loads).
However, you can choose, which part of the 32 bytes goes into which lane (as long as you keep it aligned to 32-128 bits) or even use broadcasts.
So does that means, it will be faster to load data from global memory (if I know that data can be stored in L1 cache) than to load them from shared memory with bank conflict (for example, 2-way)?
As said, it depends. Shared memory has 32 independent addresses, L1 only 4 (or 8/16 with 64/128 bit loads).
Think about L1 outputting 4 blocks of aligned 32 bytes per cycle from the cache. And each lane can choose any aligned 32 bit data from it.
64 and 128 bit accesses take 2 or 4 cycles and output 8 or 16 blocks and you can choose any 64- or 128-bit aligned data per lane from it.
We are talking about the original addresses in global memory. You do not get better coalescing rules, because the data is cached.
Whereas in shared memory, you can store the data for repeated access in any layout you want (and which follows the bank rules).
You may even be marginally worse off.
Looking at this latency table on p. 22, L1 hit latency is slightly slower for the architectures measured, vs no conflict shared latency. Even with minor conflicts, p.33 shows shared is faster.
At least the bandwidth seems to be the same: L1 Cache Effective Bandwidth - #3 by Greg
Could it be 4 bytes per load store unit (aka lsu or ld/st in the sm diagrams with the cores) per cycle?
Really interesting! So as for some conditions like sparse compute, when there are bank conflict can’t be eliminated, allocate very little shared memory and load data by L1 cache can be a better choice. Do you know any essay makes a good use to L1 cache?
I would think, the situations, when for repeated accesses L1 is really advantageous performance-wise, are very few. E.g. for sparse compute (in the sense of randomly distributed elements), the 4 32 bytes blocks requirement is quite limiting.
With shared memory you have to fill it from global memory once and then can store it in an optimal way, which avoids bank conflicts.
Reasons one would use L1 instead
- simpler: if it does not matter, for quick implementations, to keep CPU/GPU code the same
- repeating, but not predetermined accesses (or complicated access patterns): the caching feature of L1 cache
- accesses are overall random, but related at least between some of the threads
- dynamic write backs, where cache coherency matters
So if different threadblock need to use very different input, it can also harm the efficiency of L1 cache? Will it be a better choice to manually schedule the threadblock by its SM’s serial number to make the use of L1 cache better?
Depends on how much you rely on the caching feature and what your working size is relative to the cache size. You can also use more threads per block (to have less blocks per SM) or in any other way limit the blocks per SM.
If there is a closer relationship between some blocks so you would put them into the same SM id: Why not combine those blocks within one actual thread block? Use threadIdx.z to differentiate between those sub-thread blocks.
Actually reading out the SM id should be rarely needed.