Cannot achieve max shared memory bandwith

saitamatenthousand · November 16, 2023, 4:00pm

I am working with an RTX 2080 max-q mobile, compute capability 7.5. I am trying to understand why I cannot achieve 32bit/cycle shared memory bandwidth:

From https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#shared-memory-5-x

Shared memory has 32 banks that are organized such that successive 32-bit words map to successive banks. Each bank has a bandwidth of 32 bits per clock cycle.

I’ve written this test code:

#include <iostream>
#include <algorithm>
#include <numeric>

using T = float;
extern __shared__ T bank[];

constexpr int warps = 8;
constexpr int pitch = 32 * warps;
constexpr int size = 32;

__managed__ long long starts[pitch];
__managed__ long long stops[pitch];
__managed__ long long clocks[pitch];

__global__ void kernel()
{
    auto* local_bank = bank + threadIdx.x;
    auto* a = local_bank;
    auto* b = a + size * pitch;

    __syncwarp();
    auto start = clock64();
    __syncwarp();

    for (int i = 0; i < size; i++)
        b[i * pitch] = a[i * pitch];

    __syncwarp();
    auto stop = clock64();
    __syncwarp();

    auto duration = stop - start;
    printf("%5lld %s", duration, threadIdx.x % 32 == 31 ? "\n" : ""); 
    __syncthreads();
    starts[threadIdx.x + blockDim.x *blockIdx.x] = start;
    stops[threadIdx.x + blockDim.x *blockIdx.x] = stop;
    clocks[threadIdx.x + blockDim.x *blockIdx.x] = duration;
}

int main()
{
    cudaDeviceSetLimit(cudaLimitStackSize, 64 * 1024);
    cudaFuncSetAttribute(kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, 64*1024);
    cudaFuncSetCacheConfig(kernel, cudaFuncCachePreferShared);
    kernel<<<1, pitch, 2 * size * pitch * sizeof(T)>>>();
    cudaDeviceSynchronize();    

    auto min_clock = *std::min_element(std::begin(clocks), std::end(clocks));
    auto min_start = *std::min_element(std::begin(starts), std::end(starts));
    auto max_stop = *std::max_element(std::begin(stops), std::end(stops));
    auto avg_clock = std::accumulate(std::begin(clocks), std::end(clocks), 0) / (float)(std::end(clocks) - std::begin(clocks));
    auto avg_clock_per_access = avg_clock / (float)(warps * size * 2);
    printf("min = %lli\n", min_clock);
    printf("max = %lli\n", max_stop - min_start);
    printf("avg = %f\n", avg_clock_per_access);
}

It measures in clock cycles the duration of the shared memory copy for each thread and outputs the avg number of cycles taken by a shared memory access. I get 2 cycles and it should be 1 cycle. I don’t understand what I’m doing wrong.

Note: I have tried with float4 and I got the same results. You can change the type T to float4 and reduce size to 8 so that you don’t use more than 64kb of shared memory.

Robert_Crovella · November 16, 2023, 4:40pm

why are you running 8 warps?

And a shared memory copy involves a read and write. That is 2 transactions, which will certainly take 2 cycles (or more)

saitamatenthousand · November 16, 2023, 4:42pm

I already took into account that it is a copy (a read + a write) and divided by 2. I tried with 32 warps and got the same results.

rs277 · November 18, 2023, 12:39am

Another shared memory bandwith test, with code, is outlined in the “Dissecting Turing” paper, in section 3.6.2, p34.

EmilSchaetzle · November 19, 2023, 12:12am

I encounter similar behaviour on my RTX 3060. When I execute your code, I get 2 cycles per access (=> 2 bytes per cycle). However, if change to float4, it only takes 4 cycles per access (=> 4 bytes per cycle). For float2, I get 2 cycles as well (=> 4 bytes per cycle).

@Robert_Crovella I played around with the warp/size parameters as well (keeping their product constant to avoid changing the amount of shared memory) and did not observe any changes in behaviour.

I also queried the cudaDeviceSetSharedMemConfig parameter and it returned cudaSharedMemBankSizeFourByte.

Robert_Crovella · November 19, 2023, 2:05am

Agreed.

with 2 warps and float4 I get ~4.33 clocks average, so that is ~32 bits per bank per clock. With 4 warps and float2 I get ~2.09 clocks average so that is also ~32 bits per bank per clock.

I certainly observe a change in behavior going from 2 warps to 1 warp. So I would say it also seems to need 2+ warps to get max throughput.

(L4 GPU, CUDA 12.2)

All GPUs newer than Kepler will return cudaSharedMemBankSizeFourByte.

You can play around with the knobs to get slightly better measurements. For example with float2, 8 warps, and size of 16, I get ~2.03 clocks average.

saitamatenthousand · November 19, 2023, 12:14pm

I’ve modified the code to give the results in bytes/cycle and managed to get the code executed by the same thread a number of times. (the loop was getting optimized out, turns out __warpsync() was the solution).

#include <iostream>
#include <algorithm>
#include <numeric>

using T = float;
extern __shared__ T bank[];

constexpr int warps = 32;
constexpr int pitch = 32 * warps;
constexpr int size = 16 / (sizeof(T) / sizeof(float));
 
__global__ void kernel()
{
    auto* local_bank = bank + threadIdx.x;
    auto* a = local_bank;

    __syncwarp();
    auto start = clock64();
    __syncwarp();

    int iterations = 10000;
    for (int iteration = 0; iteration < iterations; iteration++)
    {
        for (int i = 0; i < size; i++)
            a[i * pitch] = T{};
        __syncwarp();
    }

    __syncwarp();
    auto stop = clock64();
    __syncwarp();

    auto duration = (float)(stop - start) / (float)iterations;
    auto bytes_per_cycle = (float)(size * pitch * sizeof(T)) / duration / warps;
    printf("%7.1f %s", bytes_per_cycle, threadIdx.x % 32 == 31 ? "\n" : ""); 
}

int main()
{
    cudaDeviceSetLimit(cudaLimitStackSize, 64 * 1024);
    cudaFuncSetAttribute(kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, 64*1024);
    cudaFuncSetCacheConfig(kernel, cudaFuncCachePreferShared);
    kernel<<<1, pitch, size * pitch * sizeof(T)>>>();
    cudaDeviceSynchronize();    
}

For float I get 2 bytes/cycles. I only get 4 bytes/cycle for float2 and float 4 which is weird. Another thing that is weird is if I run the code below for float and float2 and time it with Nsight Compute I get the same amount of time for both cases: 16.57 ms. I would expect float2 to take 8 ms since the example above proved that a block can get 4 bytes per cycle, but there is no difference. @Robert_Crovella do you happen to know why?

#include <iostream>
#include <algorithm>
#include <numeric>

using T = float;
extern __shared__ T bank[];

constexpr int warps = 32;
constexpr int pitch = 32 * warps;
constexpr int size = 16 / (sizeof(T) / sizeof(float));
 
__global__ void kernel()
{
    auto* local_bank = bank + threadIdx.x;
    auto* a = local_bank;

    int iterations = 1000;
    for (int iteration = 0; iteration < iterations; iteration++)
    {
        for (int i = 0; i < size; i++)
            a[i * pitch] = T{};
        __syncwarp();
    }
}

int main()
{
    cudaDeviceSetLimit(cudaLimitStackSize, 64 * 1024);
    cudaFuncSetAttribute(kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, 64*1024);
    cudaFuncSetCacheConfig(kernel, cudaFuncCachePreferShared);
    kernel<<<46*16, pitch, size * pitch * sizeof(T)>>>();
    cudaDeviceSynchronize();    
}

If you compute the time it should take for that to run with 4 bytes/cycles you get 8 ms and not 16 ms. It seems that the device decides to go back to 2 bytes/cycle with multiple blocks.

float:

float2:

Both kernels are mainly stalled by MIO.

Why does the device refuse to have max bandwith for float2?

EmilSchaetzle · November 19, 2023, 5:17pm

I did some additional experiments and ran across three more oddities:

If I double the pitch to “64 * warps” and reduce size to 16, I encounter another performance penalty of a factor of 2 (i.e., it takes 4 clock cycles to load/store 4 bytes instead of 2). Doubling the pitch should only use every other cell in each bank, but not generate any bank conflicts.
If I change local_bank to “bank + (threadIdx.x & ~1) + (threadIdx.x & 1) * pitch” (i.e., each thread has a bank conflict with one of its direct neighbors) and let the loop run for one less element (to avoid an invalid access), I still get 2 cycles. If I do the same for float2, I get 4 cycles! If I make it a four-way conflict (change “1” to “3” and skip the last 3 iterations) I get a bit less than twice these numbers (3.5/7.4; I did compensate for the fewer iterations in the calculation).
However, if I change local_bank to “bank + ((threadIdx.x >> 4) & 1) * pitch + (threadIdx.x & 15);” (i.e., each pair of threads at distance 16 within a warp should have a bank conflict) instead (and do one less iteration), I get 2 cycles for float AND float2. Making this a four-way conflict (shift by 3 instead of 4 and use 7 instead of 15 as a mask; skip last three iterations) I get 2 cycles (float) and 4 cycles (float2) again.

Robert_Crovella · November 20, 2023, 2:25pm

No, I don’t. I haven’t studied it much. The thread started out seeming to suggest that 32 bits per bank could not be achieved based on using clock64 measurement. From my perspective, we have now established that it can be, based on clock64 measurement.

Greg · November 20, 2023, 5:29pm

TU10x, GA10x, and AD10x chips have a SM shared memory (and shuffle) throughput of 16 threads/cycle (vs. 32 threads/cycle on GV100, GA100, GH100). The return bandwidth is 128B. Reads of <= 32-bit hit only 1/2 throughput . Reads of 64-bit and 128-bit can achieve 128B/cycle return.

EmilSchaetzle · November 20, 2023, 7:32pm

Thank you very much for the clarification. Things make a lot more sense now. Maybe the corresponding entries in the programming guide could be updated accordingly (e.g., all devices of compute capability 8.6 have 16 threads/cycle according to your information).
Until then, since my project requires a throughput of 32 threads/cycle: what is the oldest/most basic (to be honest, most inexpensive since I cannot afford to buy a A100 for this project) chip that actually reaches the 32 threads/cycle (according to the aforementioned table, all chips except compute capability 7.5…).

Robert_Crovella · November 20, 2023, 8:05pm

I’ve filed an internal bug to have the documentation reviewed in this area.

system · December 4, 2023, 8:05pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Using Shared Memory in CUDA C/C++ Technical Blog	36	1926	October 8, 2020
Requesting clarification for Shared Memory Bank Conflicts and Shared memory access? CUDA Programming and Performance hw , cuda	11	3365	January 23, 2024
Memory problem? ...incredible slowdown CUDA Programming and Performance	29	16290	January 30, 2011
CUDA memory transactions CUDA Programming and Performance	9	8805	April 11, 2011
shared memory performance kernel execution timings with one block CUDA Programming and Performance	3	3168	May 6, 2007
Cuda Memory Bank layout Interleaving, Addressing, Conflicts CUDA Programming and Performance	25	61293	September 4, 2008
Number of kilobytes transferred to/from shared memory twice the expected CUDA Programming and Performance	12	699	September 29, 2018
Test Multi Threading Spinning CUDA Programming and Performance	32	4809	July 20, 2011
Bytes in shared memory CUDA Programming and Performance	8	2970	April 19, 2017
128-bit access bank conflict CUDA Programming and Performance	11	807	March 29, 2024

Cannot achieve max shared memory bandwith

Related topics