I am working with an RTX 2080 max-q mobile, compute capability 7.5. I am trying to understand why I cannot achieve 32bit/cycle shared memory bandwidth:
Shared memory has 32 banks that are organized such that successive 32-bit words map to successive banks. Each bank has a bandwidth of 32 bits per clock cycle.
I’ve written this test code:
#include <iostream>
#include <algorithm>
#include <numeric>
using T = float;
extern __shared__ T bank[];
constexpr int warps = 8;
constexpr int pitch = 32 * warps;
constexpr int size = 32;
__managed__ long long starts[pitch];
__managed__ long long stops[pitch];
__managed__ long long clocks[pitch];
__global__ void kernel()
{
auto* local_bank = bank + threadIdx.x;
auto* a = local_bank;
auto* b = a + size * pitch;
__syncwarp();
auto start = clock64();
__syncwarp();
for (int i = 0; i < size; i++)
b[i * pitch] = a[i * pitch];
__syncwarp();
auto stop = clock64();
__syncwarp();
auto duration = stop - start;
printf("%5lld %s", duration, threadIdx.x % 32 == 31 ? "\n" : "");
__syncthreads();
starts[threadIdx.x + blockDim.x *blockIdx.x] = start;
stops[threadIdx.x + blockDim.x *blockIdx.x] = stop;
clocks[threadIdx.x + blockDim.x *blockIdx.x] = duration;
}
int main()
{
cudaDeviceSetLimit(cudaLimitStackSize, 64 * 1024);
cudaFuncSetAttribute(kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, 64*1024);
cudaFuncSetCacheConfig(kernel, cudaFuncCachePreferShared);
kernel<<<1, pitch, 2 * size * pitch * sizeof(T)>>>();
cudaDeviceSynchronize();
auto min_clock = *std::min_element(std::begin(clocks), std::end(clocks));
auto min_start = *std::min_element(std::begin(starts), std::end(starts));
auto max_stop = *std::max_element(std::begin(stops), std::end(stops));
auto avg_clock = std::accumulate(std::begin(clocks), std::end(clocks), 0) / (float)(std::end(clocks) - std::begin(clocks));
auto avg_clock_per_access = avg_clock / (float)(warps * size * 2);
printf("min = %lli\n", min_clock);
printf("max = %lli\n", max_stop - min_start);
printf("avg = %f\n", avg_clock_per_access);
}
It measures in clock cycles the duration of the shared memory copy for each thread and outputs the avg number of cycles taken by a shared memory access. I get 2 cycles and it should be 1 cycle. I don’t understand what I’m doing wrong.
Note: I have tried with float4 and I got the same results. You can change the type T to float4 and reduce size to 8 so that you don’t use more than 64kb of shared memory.
I encounter similar behaviour on my RTX 3060. When I execute your code, I get 2 cycles per access (=> 2 bytes per cycle). However, if change to float4, it only takes 4 cycles per access (=> 4 bytes per cycle). For float2, I get 2 cycles as well (=> 4 bytes per cycle).
@Robert_Crovella I played around with the warp/size parameters as well (keeping their product constant to avoid changing the amount of shared memory) and did not observe any changes in behaviour.
I also queried the cudaDeviceSetSharedMemConfig parameter and it returned cudaSharedMemBankSizeFourByte.
with 2 warps and float4 I get ~4.33 clocks average, so that is ~32 bits per bank per clock. With 4 warps and float2 I get ~2.09 clocks average so that is also ~32 bits per bank per clock.
I certainly observe a change in behavior going from 2 warps to 1 warp. So I would say it also seems to need 2+ warps to get max throughput.
(L4 GPU, CUDA 12.2)
All GPUs newer than Kepler will return cudaSharedMemBankSizeFourByte.
You can play around with the knobs to get slightly better measurements. For example with float2, 8 warps, and size of 16, I get ~2.03 clocks average.
I’ve modified the code to give the results in bytes/cycle and managed to get the code executed by the same thread a number of times. (the loop was getting optimized out, turns out __warpsync() was the solution).
#include <iostream>
#include <algorithm>
#include <numeric>
using T = float;
extern __shared__ T bank[];
constexpr int warps = 32;
constexpr int pitch = 32 * warps;
constexpr int size = 16 / (sizeof(T) / sizeof(float));
__global__ void kernel()
{
auto* local_bank = bank + threadIdx.x;
auto* a = local_bank;
__syncwarp();
auto start = clock64();
__syncwarp();
int iterations = 10000;
for (int iteration = 0; iteration < iterations; iteration++)
{
for (int i = 0; i < size; i++)
a[i * pitch] = T{};
__syncwarp();
}
__syncwarp();
auto stop = clock64();
__syncwarp();
auto duration = (float)(stop - start) / (float)iterations;
auto bytes_per_cycle = (float)(size * pitch * sizeof(T)) / duration / warps;
printf("%7.1f %s", bytes_per_cycle, threadIdx.x % 32 == 31 ? "\n" : "");
}
int main()
{
cudaDeviceSetLimit(cudaLimitStackSize, 64 * 1024);
cudaFuncSetAttribute(kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, 64*1024);
cudaFuncSetCacheConfig(kernel, cudaFuncCachePreferShared);
kernel<<<1, pitch, size * pitch * sizeof(T)>>>();
cudaDeviceSynchronize();
}
For float I get 2 bytes/cycles. I only get 4 bytes/cycle for float2 and float 4 which is weird. Another thing that is weird is if I run the code below for float and float2 and time it with Nsight Compute I get the same amount of time for both cases: 16.57 ms. I would expect float2 to take 8 ms since the example above proved that a block can get 4 bytes per cycle, but there is no difference. @Robert_Crovella do you happen to know why?
#include <iostream>
#include <algorithm>
#include <numeric>
using T = float;
extern __shared__ T bank[];
constexpr int warps = 32;
constexpr int pitch = 32 * warps;
constexpr int size = 16 / (sizeof(T) / sizeof(float));
__global__ void kernel()
{
auto* local_bank = bank + threadIdx.x;
auto* a = local_bank;
int iterations = 1000;
for (int iteration = 0; iteration < iterations; iteration++)
{
for (int i = 0; i < size; i++)
a[i * pitch] = T{};
__syncwarp();
}
}
int main()
{
cudaDeviceSetLimit(cudaLimitStackSize, 64 * 1024);
cudaFuncSetAttribute(kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, 64*1024);
cudaFuncSetCacheConfig(kernel, cudaFuncCachePreferShared);
kernel<<<46*16, pitch, size * pitch * sizeof(T)>>>();
cudaDeviceSynchronize();
}
If you compute the time it should take for that to run with 4 bytes/cycles you get 8 ms and not 16 ms. It seems that the device decides to go back to 2 bytes/cycle with multiple blocks.
I did some additional experiments and ran across three more oddities:
If I double the pitch to “64 * warps” and reduce size to 16, I encounter another performance penalty of a factor of 2 (i.e., it takes 4 clock cycles to load/store 4 bytes instead of 2). Doubling the pitch should only use every other cell in each bank, but not generate any bank conflicts.
If I change local_bank to “bank + (threadIdx.x & ~1) + (threadIdx.x & 1) * pitch” (i.e., each thread has a bank conflict with one of its direct neighbors) and let the loop run for one less element (to avoid an invalid access), I still get 2 cycles. If I do the same for float2, I get 4 cycles! If I make it a four-way conflict (change “1” to “3” and skip the last 3 iterations) I get a bit less than twice these numbers (3.5/7.4; I did compensate for the fewer iterations in the calculation).
However, if I change local_bank to “bank + ((threadIdx.x >> 4) & 1) * pitch + (threadIdx.x & 15);” (i.e., each pair of threads at distance 16 within a warp should have a bank conflict) instead (and do one less iteration), I get 2 cycles for float AND float2. Making this a four-way conflict (shift by 3 instead of 4 and use 7 instead of 15 as a mask; skip last three iterations) I get 2 cycles (float) and 4 cycles (float2) again.
No, I don’t. I haven’t studied it much. The thread started out seeming to suggest that 32 bits per bank could not be achieved based on using clock64 measurement. From my perspective, we have now established that it can be, based on clock64 measurement.
TU10x, GA10x, and AD10x chips have a SM shared memory (and shuffle) throughput of 16 threads/cycle (vs. 32 threads/cycle on GV100, GA100, GH100). The return bandwidth is 128B. Reads of <= 32-bit hit only 1/2 throughput . Reads of 64-bit and 128-bit can achieve 128B/cycle return.
Thank you very much for the clarification. Things make a lot more sense now. Maybe the corresponding entries in the programming guide could be updated accordingly (e.g., all devices of compute capability 8.6 have 16 threads/cycle according to your information).
Until then, since my project requires a throughput of 32 threads/cycle: what is the oldest/most basic (to be honest, most inexpensive since I cannot afford to buy a A100 for this project) chip that actually reaches the 32 threads/cycle (according to the aforementioned table, all chips except compute capability 7.5…).