How are types larger than 4 bytes stored in shared memory, and how does this relate to bank conflicts

msantana1 · April 18, 2025, 11:08pm

I am learning CUDA programming, and feel like I understand bank conflicts and the importance of avoiding them. However, I can’t find anything about how types larger than 4 bytes (e.g. double, cufftComplex) are stored in shared memory. If I have an array of doubles will it each double be stored across two banks? If yes, this seems to make avoiding bank conflicts more of a headache What about a small structs like cufftComplex from the cuFFT library?

Any explanation about how a double is stored in shared memory, and how a warp accesses it would be much appreciated.

njuffa · April 19, 2025, 12:17am

double is represented by the IEEE-754 type binary64, a sequence of eight consecutive bytes. Since the GPU has a little-endian architecture, these bytes are arranged in little-endian storage order.

Complex types are stored as a sequence of eight or sixteen consecutive bytes, respectively. The real part occupies the least significant {four | eight} bytes and the imaginary part occupies the most significant {four | eight} bytes, matching the storage order prescribed by the ISO standards for C, C++, and Fortran.

These built-in data types are naturally aligned, meaning their starting address must be an integer multiple of their size.

To first order, I would not concern myself with potential shared-memory bank conflicts caused by using data of these types. Write code in a natural style appropriate for the task at hand, and leave micro optimization and ninja-level optimizations for later (and potentially defer them indefinitely). Avoid getting bogged down in minutiae while still trying to get a handle on basic CUDA programming concepts.

If other forum participants disagree with this advice, I am sure they will speak up.

msantana1 · April 19, 2025, 3:35am

I mean, I agree that at first it may not be the best to focus on such a minute detail, but this still doesn’t answer the question. Is a double stored in one bank or two? Maybe this depends on your compute capability? Even if I don’t implement a complicated speed up I still want to know. It is not clear to me why the answer to this question is so difficult to find by google search.

striker159 · April 19, 2025, 12:09pm

Assuming the standard 32bit banks, an 8-byte value is served by two banks

For a coalesced load of 32 doubles across a warp, the data will be loaded in two bankconflict free transactions of 128bytes each

Curefab · April 19, 2025, 4:37pm

Yes, the shared memory can maximally transmit 32 bits per bank per transaction anyway.

So the presumed 2x bank conflict is solved by the access, which takes twice as long.

The GPU has to do some internal reordering, as 16 lanes (threads of a warp) get the first 4 bytes first, the 16 others the second 4 bytes to keep the data paths at 32 bits, avoid bank conflicts within the two internal transactions and not slow down 64 bit accesses to 1/4 of the speed (instead of 1/2 of the speed).

But that is all done transparently. One cannot watch the order and it is not and does not have to be documented.

The supported access modes are 32, 64,128 bits naturally taking 1x, 2x, 4x as long. Shorter accesses (like 16 bits) are just done as 32 bit accesses and are as fast (and not twice as fast).

With speed I mean here reads/writes per time (vs. bytes per time, which you can also reason about).

msantana1 · April 20, 2025, 12:29am

So let me see if I am understanding correctly. If I am trying to minimize bank conflicts for a certain algorthm which uses doubles, I can just think of it like the float version of the algorithm, and let the GPU take care of the finer details?

Curefab · April 20, 2025, 5:02am

If you do 64-bit accesses think of shared memory as having 16 banks with 64 bits.

(That is not correct from a hardware perspective, but the effect on programming is like that.)

And as 8 banks with 128 bits for complex double.

Alternatively you can store and access complex double as two 64 bit double. But if you split the access in two, the access pattern (as indicated above) changes to avoid bank conflicts.

I show you 6 different ways to store and access complex double numbers, choose yours:

typedef struct  {
    double r;
    double i;
} complex; // same for cufftComplex

typedef union {
    double d[64];
    complex c[32];
} either;

__shared__ volatile either e;
__shared__ volatile double d[64]; // real and imaginary: 0 - real, 1 - imaginary, 2 - real, ...
__shared__ volatile double d2[64]; // real and imaginary: 0..31 - real, 32..64 - imaginary
__shared__ volatile complex c2[32];
__shared__ volatile double r2[32];
__shared__ volatile double i2[32];

int i = threadidx.x; // 0..31 for size 32


// bank conflicts

d[i*2];
d[i*2+1];
e.d[i*2]; // real values
e.d[i*2+1]; // imaginary values


// resolution 1

e.c[i]; // okay
c2[i]; // okay


// (resolution 2)
e.d[i]; // okay, however components of double distributed on 2 threads! Sometimes it does not matter, for example you just want to scale or add numbers.


// resolution 3

// first read in different order, then correct again
// the correction is done locally, so it is fast

int odd = i & 1;
int idxA = i * 2 + odd;
int idxB = i * 2 + 1 - odd;

// okay, no bank conflicts!

double d1 = e.d[idxA];
double d2 = e.d[idxB];
(odd ? d2 : d1); // real
(odd ? d1 : d2); // imaginary


// resolution 4

r2[i]; // okay
i2[i]; // okay


// resolution 5

d2[i]; // okay, real
d2[i + 32]; // okay, imaginary

PS
All recent non datacenter-GPUs are “very slow” (compared to float) for double computations. For product differentiation and silicon real estate (often not needed for consumers, so why waste chip space) reasons. Memory accesses of double are always fast also with those GPUs.

system · May 4, 2025, 5:03am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to elegantly handle double arrays in shared memory without inducing bank conflict? CUDA Programming and Performance	1	447	July 28, 2019
Trade-off Between Bank Conflict and Thread Count in Shared Memory Access CUDA Programming and Performance cuda	9	63	June 23, 2025
Resolve 1D shared memory bank conflict with paddling CUDA Programming and Performance cuda , kernel	9	105	September 1, 2024
confusion about 64 bit shared memory access CUDA Programming and Performance	1	1257	May 10, 2012
Shared memory broadcast and doubles CUDA Programming and Performance	4	1550	May 31, 2008
dont understand bank conflicts for shared mem CUDA Programming and Performance	7	2646	March 31, 2010
do not understand bank conflicts please help CUDA Programming and Performance	7	2694	December 22, 2012
64-bit shared memory with minimal bank conflict? CUDA Programming and Performance	4	2499	March 21, 2016
beginner question regarding shared memory CUDA Programming and Performance	4	6935	November 16, 2009
How to understand the bank conflict of shared_mem CUDA Programming and Performance	12	10990	January 16, 2025

How are types larger than 4 bytes stored in shared memory, and how does this relate to bank conflicts

Related topics