How are types larger than 4 bytes stored in shared memory, and how does this relate to bank conflicts

I am learning CUDA programming, and feel like I understand bank conflicts and the importance of avoiding them. However, I can’t find anything about how types larger than 4 bytes (e.g. double, cufftComplex) are stored in shared memory. If I have an array of doubles will it each double be stored across two banks? If yes, this seems to make avoiding bank conflicts more of a headache What about a small structs like cufftComplex from the cuFFT library?

Any explanation about how a double is stored in shared memory, and how a warp accesses it would be much appreciated.

double is represented by the IEEE-754 type binary64, a sequence of eight consecutive bytes. Since the GPU has a little-endian architecture, these bytes are arranged in little-endian storage order.

Complex types are stored as a sequence of eight or sixteen consecutive bytes, respectively. The real part occupies the least significant {four | eight} bytes and the imaginary part occupies the most significant {four | eight} bytes, matching the storage order prescribed by the ISO standards for C, C++, and Fortran.

These built-in data types are naturally aligned, meaning their starting address must be an integer multiple of their size.

To first order, I would not concern myself with potential shared-memory bank conflicts caused by using data of these types. Write code in a natural style appropriate for the task at hand, and leave micro optimization and ninja-level optimizations for later (and potentially defer them indefinitely). Avoid getting bogged down in minutiae while still trying to get a handle on basic CUDA programming concepts.

If other forum participants disagree with this advice, I am sure they will speak up.

I mean, I agree that at first it may not be the best to focus on such a minute detail, but this still doesn’t answer the question. Is a double stored in one bank or two? Maybe this depends on your compute capability? Even if I don’t implement a complicated speed up I still want to know. It is not clear to me why the answer to this question is so difficult to find by google search.

Assuming the standard 32bit banks, an 8-byte value is served by two banks

For a coalesced load of 32 doubles across a warp, the data will be loaded in two bankconflict free transactions of 128bytes each

Yes, the shared memory can maximally transmit 32 bits per bank per transaction anyway.

So the presumed 2x bank conflict is solved by the access, which takes twice as long.

The GPU has to do some internal reordering, as 16 lanes (threads of a warp) get the first 4 bytes first, the 16 others the second 4 bytes to keep the data paths at 32 bits, avoid bank conflicts within the two internal transactions and not slow down 64 bit accesses to 1/4 of the speed (instead of 1/2 of the speed).

But that is all done transparently. One cannot watch the order and it is not and does not have to be documented.

The supported access modes are 32, 64,128 bits naturally taking 1x, 2x, 4x as long. Shorter accesses (like 16 bits) are just done as 32 bit accesses and are as fast (and not twice as fast).

With speed I mean here reads/writes per time (vs. bytes per time, which you can also reason about).

So let me see if I am understanding correctly. If I am trying to minimize bank conflicts for a certain algorthm which uses doubles, I can just think of it like the float version of the algorithm, and let the GPU take care of the finer details?

If you do 64-bit accesses think of shared memory as having 16 banks with 64 bits.

(That is not correct from a hardware perspective, but the effect on programming is like that.)

And as 8 banks with 128 bits for complex double.

Alternatively you can store and access complex double as two 64 bit double. But if you split the access in two, the access pattern (as indicated above) changes to avoid bank conflicts.

I show you 6 different ways to store and access complex double numbers, choose yours:

typedef struct  {
    double r;
    double i;
} complex; // same for cufftComplex

typedef union {
    double d[64];
    complex c[32];
} either;

__shared__ volatile either e;
__shared__ volatile double d[64]; // real and imaginary: 0 - real, 1 - imaginary, 2 - real, ...
__shared__ volatile double d2[64]; // real and imaginary: 0..31 - real, 32..64 - imaginary
__shared__ volatile complex c2[32];
__shared__ volatile double r2[32];
__shared__ volatile double i2[32];

int i = threadidx.x; // 0..31 for size 32


// bank conflicts

d[i*2];
d[i*2+1];
e.d[i*2]; // real values
e.d[i*2+1]; // imaginary values


// resolution 1

e.c[i]; // okay
c2[i]; // okay


// (resolution 2)
e.d[i]; // okay, however components of double distributed on 2 threads! Sometimes it does not matter, for example you just want to scale or add numbers.


// resolution 3

// first read in different order, then correct again
// the correction is done locally, so it is fast

int odd = i & 1;
int idxA = i * 2 + odd;
int idxB = i * 2 + 1 - odd;

// okay, no bank conflicts!

double d1 = e.d[idxA];
double d2 = e.d[idxB];
(odd ? d2 : d1); // real
(odd ? d1 : d2); // imaginary


// resolution 4

r2[i]; // okay
i2[i]; // okay


// resolution 5

d2[i]; // okay, real
d2[i + 32]; // okay, imaginary

PS
All recent non datacenter-GPUs are “very slow” (compared to float) for double computations. For product differentiation and silicon real estate (often not needed for consumers, so why waste chip space) reasons. Memory accesses of double are always fast also with those GPUs.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.