Why don't GPUs have multiple independent memory banks

Uncle_Joe · December 5, 2009, 8:01pm

I just realized NVIDIA GPUs probably only have 1 address bus, which explains why non contiguous (coalesced) accesses need multiple transactions. Why not make 8 or 6 independent banks? The biggest hurdle I can think of is pin count. 8 Address/Data busses would consume 64 * 8 = 512 pins. Does anyone know how many pins the GT200s have and how many are for memory?

LSChien · December 6, 2009, 1:51am

GT200 has 8 independent channels (or called partititons), each channel has 256 byte width and has 64-bit memory interface.

the best performance occurs when access-patterns of all SMs are evenly distributed among 8 channels.

Uncle_Joe · December 6, 2009, 4:08am

OK, after some analysis, I’ve figured 8 parallel banks doesn’t buy you much flexibility. See memory_banks.png.

8 parallel banks still will suffer more conflicts the bigger the access stride. Although accessing 0, 9, 18, 27, 36, 45, 54, 63 are to independent banks, practically no application will use this pattern, so they probably didn’t bother offering this flexibility and insisted that all accesses fall in same segment. So in the end, 8 || banks is a mirage - it doesn’t help much besides stride 1 accesses.

1 proposal would be to have 32 banks. Then even 8 stride 4 accesses can all map to different banks. CRAYs had hundreds of memory banks, which I guess help strided accesses. But what separates commodity clusters from super computers is the better interconnect.

This still doesn’t answer if GPUs have a single or multiple address buses. I still suspect all banks use the same address, but each has its own data bus?

LSChien · December 6, 2009, 8:14am

what’s your test method? Could you post it?

I think that programmer can control thread blocks to access different channel,

like diagonal-ordering technique to solve partition camping,

you can see document in SDK/transposeNew or thread

http://forums.nvidia.com/index.php?showtop…rt=#entry590852

Uncle_Joe · December 6, 2009, 9:28am

I didn’t test stride 9 access (stride 18 if you consider 4 byte words) myself. I just guessed based on my computer architecture knowledge and the engineer’s good sense. The CUDA manual specifically says a necessary condition

for coalesced read/write is for all addresses in a half warp to belong to the same 16, 32, 64, or 128 byte memory segment.

Stride 9 accesses won’t fall into 1 segment, so hardware won’t coalesce. However, the address space partitioning I showed does allow such || access.

Jimmy_Pettersson · December 6, 2009, 5:24pm

Diagonal reordering or padding can solve this issue.

It’s analoguos when working with a matrix in shared memory…

Switch

__shared__ float tile[16][16];

to

__shared__ float tile[16][17];

Uncle_Joe · December 7, 2009, 8:04pm

OK my mistake, GPUs do use multiple independent banks, according to the SDK’s TransposeNew, which I never saw before. Thanks for pointing that out.

Earlier, I thought a coalesced memory op accesses all banks in || with same address. In retrospect, this doesn’t make sense because the memory bus width is usually > coalesced read/write size, resulting in wasted bandwidth.

Since every contiguous 256 bytes are in same partition and assuming each 256 byte word is handled by 1 memory bank, that means each multiprocessor can only access 1 memory bank at a time, hence if you only have 1 thread block, you only achieve 1/#banks of peak bandwidth. I verified this with a benchmark - I only get ~13GiB/s (why > theoretical bound of 95.4 / 8 = 11.9 GiB/s ?) as expected.

Topic		Replies	Views
Cuda Memory Bank layout Interleaving, Addressing, Conflicts CUDA Programming and Performance	25	61677	September 4, 2008
How to find number of banks in GPU global memory? CUDA Programming and Performance	4	3515	June 12, 2016
How to understand the bank conflict of shared_mem CUDA Programming and Performance	16	14421	November 19, 2025
Requesting clarification for Shared Memory Bank Conflicts and Shared memory access? CUDA Programming and Performance hw , cuda	11	5013	January 23, 2024
Memory coalescing in one thread CUDA Programming and Performance	17	16779	March 31, 2011
Memory transaction size and coalesced access CUDA Programming and Performance	6	4956	November 12, 2008
Partition camping ? CUDA Programming and Performance	5	11148	March 23, 2010
How do multicore access shared memory at same time? CUDA Programming and Performance	1	5835	January 15, 2009
Trade-off Between Bank Conflict and Thread Count in Shared Memory Access CUDA Programming and Performance cuda	9	193	June 23, 2025
How to understand memory access? CUDA Programming and Performance	1	706	October 16, 2023

Why don't GPUs have multiple *independent* memory banks

Related topics

Why don't GPUs have multiple independent memory banks