I just realized NVIDIA GPUs probably only have 1 address bus, which explains why non contiguous (coalesced) accesses need multiple transactions. Why not make 8 or 6 independent banks? The biggest hurdle I can think of is pin count. 8 Address/Data busses would consume 64 * 8 = 512 pins. Does anyone know how many pins the GT200s have and how many are for memory?
GT200 has 8 independent channels (or called partititons), each channel has 256 byte width and has 64-bit memory interface.
the best performance occurs when access-patterns of all SMs are evenly distributed among 8 channels.
OK, after some analysis, I’ve figured 8 parallel banks doesn’t buy you much flexibility. See memory_banks.png.
8 parallel banks still will suffer more conflicts the bigger the access stride. Although accessing 0, 9, 18, 27, 36, 45, 54, 63 are to independent banks, practically no application will use this pattern, so they probably didn’t bother offering this flexibility and insisted that all accesses fall in same segment. So in the end, 8 || banks is a mirage - it doesn’t help much besides stride 1 accesses.
1 proposal would be to have 32 banks. Then even 8 stride 4 accesses can all map to different banks. CRAYs had hundreds of memory banks, which I guess help strided accesses. But what separates commodity clusters from super computers is the better interconnect.
This still doesn’t answer if GPUs have a single or multiple address buses. I still suspect all banks use the same address, but each has its own data bus?
what’s your test method? Could you post it?
I think that programmer can control thread blocks to access different channel,
like diagonal-ordering technique to solve partition camping,
you can see document in SDK/transposeNew or thread
I didn’t test stride 9 access (stride 18 if you consider 4 byte words) myself. I just guessed based on my computer architecture knowledge and the engineer’s good sense. The CUDA manual specifically says a necessary condition
for coalesced read/write is for all addresses in a half warp to belong to the same 16, 32, 64, or 128 byte memory segment.
Stride 9 accesses won’t fall into 1 segment, so hardware won’t coalesce. However, the address space partitioning I showed does allow such || access.
Diagonal reordering or padding can solve this issue.
It’s analoguos when working with a matrix in shared memory…
__shared__ float tile;
__shared__ float tile;
OK my mistake, GPUs do use multiple independent banks, according to the SDK’s TransposeNew, which I never saw before. Thanks for pointing that out.
Earlier, I thought a coalesced memory op accesses all banks in || with same address. In retrospect, this doesn’t make sense because the memory bus width is usually > coalesced read/write size, resulting in wasted bandwidth.
Since every contiguous 256 bytes are in same partition and assuming each 256 byte word is handled by 1 memory bank, that means each multiprocessor can only access 1 memory bank at a time, hence if you only have 1 thread block, you only achieve 1/#banks of peak bandwidth. I verified this with a benchmark - I only get ~13GiB/s (why > theoretical bound of 95.4 / 8 = 11.9 GiB/s ?) as expected.