Long Scoreboard Stalls On L1 Cache Hits

The Kernel

__global__ void sumRows(float *__restrict__ A, float *__restrict__ y,
                        size_t numRows, size_t numCols) {
  int row = blockIdx.x * blockDim.x + threadIdx.x;
  float yy = 0;
  for (size_t col = 0; col < numCols; col++)
    yy += A[row * numCols + col];
  y[row] = yy;
}

Each thread in a warp performs a reduction over its corresponding row of the matrix.
Since accesses to A are fully uncoalesced, each load spans 32 sectors of different cache lines (assuming numCols = 128). As long as those cache lines remain resident, subsequent loads (within the same cache sector, 8 elements per sector) should ideally hit in L1 cache.

Expectation

I ran this kernel with numRows = 131072 and numCols = 128.

I would expect that long scoreboard stalls occur only at the first FADD, since the subsequent three loads should be L1 cache hits and therefore fast.
Surprisingly, the stalls remain significant.

I have confirmed that the L1 hit rate is 87.5%, i.e., 7/8, indicating that the cache lines are not being evicted.

Questions

  1. In the unrolled version of this loop, the elements loaded by successive instructions belong to the same cache sector.
    • If the first load misses and fetches the sector from DRAM, do the following loads also issue DRAM requests?
    • Or will they wait for the sector to arrive and then read from L1 cache without additional memory traffic?
  2. I also observed an interesting scaling pattern in kernel cycle count. When varying the number of rows:
    • On an RTX 3050 Laptop GPU (16 SM Ă— 128 CUDA cores), the total cycles increase in steps of 2048 rows.
    • On an RTX 4080 GPU (76 SM), the step size becomes 9728 rows.
    • In contrast, a column-sum version of the kernel (where accesses are coalesced) shows smooth scaling with no such steps.

Why does this stepwise increase occur? It seems to align with the total number of CUDA cores, but I would appreciate any insight into why that correlation appears.

Environment

I’m using a laptop with an NVIDIA GeForce RTX 3050 Laptop GPU (Ampere architecture, compute capability 8.6).

The code is compiled with:

$ nvcc -arch=sm_86 -O3 report.cu -o report.out

Compiler and driver details:

$ nvcc --version                                            
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Tue_Oct_29_23:50:19_PDT_2024
Cuda compilation tools, release 12.6, V12.6.85
Build cuda_12.6.r12.6/compiler.35059454_0

$ gcc --version
gcc (Ubuntu 13.1.0-8ubuntu1~22.04) 13.1.0
  • Driver version: 560.35.03
  • CUDA version: 12.6
  • OS: Ubuntu 22.04, kernel 6.8.0-85-generic

The report.cu file:

Summary
#include <cstdlib>
#include <cuda.h>
#include <iostream>
#include <memory>
#include <random>

#ifndef NUM_COLS
#define NUM_COLS 128
#endif
#ifndef BLOCK_SIZE
#define BLOCK_SIZE 128
#endif

__global__ void sumRows(float *__restrict__ A, float *__restrict__ y,
                        size_t numRows, size_t numCols) {
  int row = blockIdx.x * blockDim.x + threadIdx.x;
  float yy = 0;
  for (size_t col = 0; col < numCols; col++)
    yy += A[row * numCols + col];
  y[row] = yy;
}

#define CC(call)                                                               \
  {                                                                            \
    cudaError_t err = call;                                                    \
    if (err != cudaSuccess) {                                                  \
      std::cerr << "CUDA error in " << __FILE__ << " at line " << __LINE__     \
                << ": " << cudaGetErrorString(err) << std::endl;               \
      std::exit(EXIT_FAILURE);                                                 \
    }                                                                          \
  }

int main(int argc, const char *argv[]) {
  if (argc != 2) {
    std::cerr << "usage: " << argv[0] << " numRows" << std::endl;
    std::exit(EXIT_FAILURE);
  }

  const auto numRows = std::atoi(argv[1]);
  const auto numCols = NUM_COLS;
  const auto blockSize = BLOCK_SIZE;
  const auto numBlocks = (numRows + blockSize - 1) / blockSize;

  auto h_A = std::make_unique<float[]>(numRows * numCols);
  auto h_y = std::make_unique<float[]>(numRows);

  int anyFixedSeed = 0;
  std::mt19937 rng(anyFixedSeed);
  std::uniform_real_distribution<float> dist(-1000.0f, 1000.0f);
  for (size_t i = 0; i < numRows * numCols; i++)
    h_A[i] = dist(rng);

  float *d_A, *d_y;
  CC(cudaMalloc(&d_y, sizeof(float) * numRows));
  CC(cudaMalloc(&d_A, sizeof(float) * numRows * numCols));
  CC(cudaMemcpy(d_A, h_A.get(), sizeof(float) * numRows * numCols,
                cudaMemcpyDefault));
  sumRows<<<numBlocks, blockSize>>>(d_A, d_y, numRows, numCols);
  CC(cudaGetLastError());

  CC(cudaMemcpy(h_y.get(), d_y, sizeof(float) * numRows, cudaMemcpyDefault));

  CC(cudaFree(d_A));
  CC(cudaFree(d_y));

  return 0;
}
  1. In the unrolled version of this loop, the elements loaded by successive instructions belong to the same cache sector.
  • If the first load misses and fetches the sector from DRAM, do the following loads also issue DRAM requests?
  • Or will they wait for the sector to arrive and then read from L1 cache without additional memory traffic?

The first load will report a cache line tag lookup miss sector miss. The missed sectors, not necessarily the whole cache line, will be fetched.

The second load to the same sector will report the following:

  • If the second load happens before the first load completes it will be a tag hit, sector hit as the cache for a warp in pretty much in order. The request will be added to the pending request table and completed when the first request completes.
  • If the second load happens after the fist load completes then the answer relies on the eviction policy. If the line has been evicted then tag miss, sector miss. If the line is still resident then tag hit, sector hit.

A lot of the design of the L1 cache is due to historic in order completion. With Volta the “enhanced” L1 cache supports out of order hit for a warp but in order miss. Starting in GA10x or Ada out of order miss between warps was added. Misses from a single warp return in order.

  1. I also observed an interesting scaling pattern in kernel cycle count. When varying the number of rows:
  • On an RTX 3050 Laptop GPU (16 SM Ă— 128 CUDA cores), the total cycles increase in steps of 2048 rows.
  • On an RTX 4080 GPU (76 SM), the step size becomes 9728 rows.
  • In contrast, a column-sum version of the kernel (where accesses are coalesced) shows smooth scaling with no such steps.

This is due to occupancy. The reproducible shows a blockDim of 128 threads.

  • 2048 = 16 SMs x 128 threads/block
  • 9728 = 76 SMs x 128 threads/block
    As more warps are added to the SM there is more ability to hide latency. When the row count exceeds 1 wave of thread blocks there will be another major cliff and the latency will increase linearly with the number of waves of thread blocks.

I would expect that long scoreboard stalls occur only at the first FADD , since the subsequent three loads should be L1 cache hits and therefore fast.
Surprisingly, the stalls remain significant.

Given the lack of SASS documentation this is a reasonable expectation. I believe the firs set of stalls you are seeing on the IMAD.X is due to a register dependency on R7. The LDG.E.CONSTANT Rdest, [R6.64 <+ offset>] likely have a read scoreboard set on R6/R7. The IMAD.X is updating R7.

PERFORMANCE OF STRIDED ACCESS PATTERN
The addressing method you are using seems like it would be efficient as the cache hit is high; however, this is a very inefficient method as numCols increase.

When you issue a warp to LSU that has high address divergence the instruction is replayed in the LSU pipe until all threads can be processed. With 32 way address divergence each instruction is using the LSU unit for up to 32 cycles vs. 1 cycle for a fully coalesced access. The LSU pipe is shared by all SM sub-partitions. With this high rate of divergence every access is 31x more expensive even if it hits in the L1 cache. Yes, you get a high hit rate and not additional increase in L1 to L2 and L2 to DRAM; however, NCU likely reports L1TEX as the limiter.

I expect the LDG instructions have a high lg_throttle or mio_throttle stall reason due to the back pressure from the LSU pipe.

1 Like

Thank you so much for the detailed explanation. That was extremely helpful!

Your discussion about the LSU replay and dependency on R7 clarified many of my initial misunderstandings.

I have a couple of follow-up questions to further confirm my understanding.

About the Long Scoreboard Stalls

From the disassembly, I can see that the compiler unrolled the loop by a factor of 4. All four global load instructions are scheduled toward the beginning, and the corresponding four FADDs that consume the loaded data appear later.

In the Nsight Compute results, the four FADDs show long scoreboard stalls of approximately 29.4%, 12.5%, 7.0%, and 4.4%, respectively.

Intuitively, I expected that only the first FADD would suffer from a long stall, since once the first load fetches the sector from memory, the following three loads should hit in L1 and complete almost immediately.
So I would have anticipated something more like 53%, 0.x%, 0.x%, 0.x%, assuming subsequent loads reuse the same cache sector or overlap in latency.

  • Could you help me understand why the later FADDs still experience significant long scoreboard stalls?

About the Stepwise Scaling in Kernel Cycle Count

I understand your explanation that the step size (e.g., 2048 rows on RTX 3050, 9728 on RTX 4080) corresponds to a “wave” of active thread blocks filling all SMs once.

However, I still have some doubts about why the step effect is much more visible in the uncoalesced sumRows case than in the coalesced sumCols (GlobalTransposed) version.

  • Could the stronger step pattern in sumRows be due to the LSU replay overhead — that is, since uncoalesced accesses occupy the LSU for more cycles per warp, fewer independent warps are able to hide latency effectively, so performance becomes more sensitive to occupancy “waves”?

Additionally, I tested allocating A in texture memory instead of global memory.
In that case, the step pattern almost disappears (though there’s still an interesting cliff near 2^14 + 2048 x 4 rows on the RTX 3050, and near 2^17 rows on the RTX 4080).
After that cliff, the step pattern for global memory also becomes less pronounced.

  • Would you have any insight into what might cause this cliff and the apparent “smoothing” of the steps afterward?

(Profiled every 128 rows.)

Zoomed in around the cliff:


Thank you again for your time and the great insights you’ve already shared.
Your explanations have been invaluable for understanding the subtle performance behavior here!

Thank you for the well articulated questions. It makes it much easier to provide a response.

About the Long Scoreboard Stalls

I may not be able to get back on with a detailed answer based upon running your test for a week but this is my expectation.

  1. The compiler has unrolled the loop 4 times resulting in 4 back to back LDG; however, you have at least 3 other warps on the SM given as 128 thread block size. The likely execution order is

warp 0 LDG 1
warp 1 LDG 1
warp 2 LDG 1
warp 3 LDG 1
warp 0 LDG 2
… and so on in close to a round robin order. This doesn’t have to be the order but I think it very likely. Due to the address divergence in each request the LSU pipe is going to be stalled for 32 cycles and the return from L2 will be 1 sector per cycle so a single instruction will take a minimum of 32 cycles on fill followed. This will spread out the return to the warp resulting in long_scoreboards on LDG2, 3, and 4. The fact that warp 0, 1, 2, and 3 will be interleaved will increase the return latency between each LDG for a specific warp.

About the Stepwise Scaling in Kernel Cycle Count

I understand your explanation that the step size (e.g., 2048 rows on RTX 3050, 9728 on RTX 4080) corresponds to a “wave” of active thread blocks filling all SMs once.

However, I still have some doubts about why the step effect is much more visible in the uncoalesced sumRows case than in the coalesced sumCols (GlobalTransposed) version.

This is also a good question. I think the primary issue is that your memory requests are so inefficient in the uncoalesced manner both in L1TEX tag and miss stage and in the fill return stage. With the fully coalesced a single instruction is taking 1 cycle through L1TEX tag and miss stage and the L2 is returning 4 x 32B per request which is scattering the return data over the 4 sectors on return. I think this would be even more obvious on a 100-class part where each SM can received L2 fill at a higher rate.

In order to provide more information on your nice charts I would need to review the kernel in more detail and think through various limits and look at some profiling reports. I think Global is either latency bound or LSU bound. I think GlobalTransposed (assume this is coalesced) and texture is more likely memory bound so there is closer to a linear relationship between rows and cycles.

When it goes off the cliff I expect the grid is getting to multiple waves or there is tail effect starting. If you run this through NCU it would be interesting to look at the PM sampling (periodic sampling of PM counters) for sm__warps_active (or warp occupancy), L1TEX metrics, L2 metrics, and DRAM metrics especially at the end of the kernel. sm__warps_active should be able to clear show the steps.

1 Like

Due to the address divergence in each request the LSU pipe is going to be stalled for 32 cycles and the return from L2 will be 1 sector per cycle so a single instruction will take a minimum of 32 cycles on fill followed. This will spread out the return to the warp resulting in long_scoreboards on LDG2, 3, and 4. The fact that warp 0, 1, 2, and 3 will be interleaved will increase the return latency between each LDG for a specific warp.

I believe this is the answer to this topic. Many thanks!

When it goes off the cliff I expect the grid is getting to multiple waves or there is tail effect starting. If you run this through NCU it would be interesting to look at the PM sampling (periodic sampling of PM counters) for sm__warps_active (or warp occupancy), L1TEX metrics, L2 metrics, and DRAM metrics especially at the end of the kernel. sm__warps_active should be able to clear show the steps.

Exactly, that makes perfect sense! The number of active warps lining up with those cliffs really helps clarify the pattern I was seeing.

I think Global is either latency bound or LSU bound. I think GlobalTransposed (assume this is coalesced) and texture is more likely memory bound so there is closer to a linear relationship between rows and cycles.

Your explanation about the difference between latency-bound (or LSU-bound) and memory-bound also sounds very reasonable. Although it’s quite tricky to collect clear profiling evidence for that kind of behavior, but the reasoning fits well with the observations!

Thank you for the insights.

For reference, here’s the reproducible showing all three implementations side by side:

#include <cstdlib>
#include <cstring>
#include <cuda.h>
#include <iostream>
#include <memory>
#include <random>
#include <type_traits>

#ifndef NUM_COLS
#define NUM_COLS 128
#endif
#ifndef BLOCK_SIZE
#define BLOCK_SIZE 128
#endif

/// Only support T = float; this is only to mimic the API of tex2D.
template <typename T = float>
__device__ __forceinline__ T __tex2Ds32(cudaTextureObject_t tex, const int x,
                                        const int y) {
  static_assert(std::is_same<T, float>::value, "Only support float type");
  T a, b, c, d;
  asm volatile("tex.2d.v4.f32.s32 {%0, %1, %2, %3}, [%4, {%5, %6}];\n\t"
               : "=f"(a), "=f"(b), "=f"(c), "=f"(d)
               : "l"(tex), "r"(x), "r"(y));
  return a;
}

__global__ void sumRows(float *__restrict__ A, float *__restrict__ y,
                        size_t numRows, size_t numCols) {
  int row = blockIdx.x * blockDim.x + threadIdx.x;
  float yy = 0;
  for (size_t col = 0; col < numCols; col++)
    yy += A[row * numCols + col];
  y[row] = yy;
}

struct TransposedTag {};

__global__ void sumRows(TransposedTag, float *__restrict__ A,
                        float *__restrict__ y, size_t numRows, size_t numCols) {
  int row = blockIdx.x * blockDim.x + threadIdx.x;
  float yy = 0;
  for (size_t col = 0; col < numCols; col++)
    yy += A[col * numRows + row];
  y[row] = yy;
}

__global__ void sumRows(cudaTextureObject_t A, float *__restrict__ y,
                        size_t numRows, size_t numCols) {
  int row = blockIdx.x * blockDim.x + threadIdx.x;
  float yy = 0;
  for (size_t col = 0; col < numCols; col++)
    yy += __tex2Ds32<float>(A, col, row);
  y[row] = yy;
}

/// A texture object wrapper that contains all the necessary information to use
/// a texture in CUDA.
struct __TextureObject {
  /// \note An alias for `resDesc.res.array.array`.
  cudaArray_t devPtr;
  cudaChannelFormatDesc fmtDesc;
  cudaResourceDesc resDesc;
  cudaTextureDesc texDesc;
  /// The actual CUDA texture object.
  cudaTextureObject_t tex;
};

cudaError_t __malloc2DTextureObject(__TextureObject *texObj, size_t width,
                                    size_t height) {
  cudaError_t err;

  std::memset(texObj, 0, sizeof(__TextureObject));

  texObj->fmtDesc = cudaCreateChannelDesc<float>();
  err = cudaMallocArray(&texObj->devPtr, &texObj->fmtDesc, width, height);
  if (err != cudaSuccess) {
    return err;
  }

  texObj->resDesc.resType = cudaResourceTypeArray;
  texObj->resDesc.res.array.array = texObj->devPtr;

  // These are the default values for the texture descriptor; can omit them.
  texObj->texDesc.filterMode = cudaFilterModePoint;
  texObj->texDesc.addressMode[0] = cudaAddressModeWrap;
  texObj->texDesc.addressMode[1] = cudaAddressModeWrap;
  texObj->texDesc.readMode = cudaReadModeElementType;
  texObj->texDesc.normalizedCoords = 0;

  err = cudaCreateTextureObject(&texObj->tex, &texObj->resDesc,
                                &texObj->texDesc, nullptr);
  if (err != cudaSuccess) {
    cudaFreeArray(texObj->devPtr);
    return err;
  }

  return cudaSuccess;
}

cudaError_t __memcpy2DTextureObject(__TextureObject *texObj, void *data,
                                    size_t width, size_t height) {
  // The width in memory in bytes of the 2D array pointed to by devPtr,
  // including padding. We don't have any padding.
  return cudaMemcpy2DToArray(texObj->devPtr, 0, 0, data, width * sizeof(float),
                             width * sizeof(float), height,
                             cudaMemcpyHostToDevice);
}

cudaError_t __freeTextureObject(__TextureObject *texObj) {
  cudaError_t err;

  err = cudaDestroyTextureObject(texObj->tex);
  if (err != cudaSuccess) {
    return err;
  }
  err = cudaFreeArray(texObj->devPtr);
  if (err != cudaSuccess) {
    return err;
  }

  return cudaSuccess;
}

#define CC(call)                                                               \
  {                                                                            \
    cudaError_t err = call;                                                    \
    if (err != cudaSuccess) {                                                  \
      std::cerr << "CUDA error in " << __FILE__ << " at line " << __LINE__     \
                << ": " << cudaGetErrorString(err) << std::endl;               \
      std::exit(EXIT_FAILURE);                                                 \
    }                                                                          \
  }

enum Mode {
  kGlobalMem = 0,
  kGlobalTransposed = 1,
  kTextureMem = 2,
};

int main(int argc, const char *argv[]) {
  if (argc != 3) {
    std::cerr << "usage: " << argv[0] << " numRows mode" << std::endl;
    std::exit(EXIT_FAILURE);
  }

  const auto numRows = std::atoi(argv[1]);
  const auto numCols = NUM_COLS;
  const auto mode = static_cast<Mode>(std::atoi(argv[2]));
  const auto blockSize = BLOCK_SIZE;
  const auto numBlocks = (numRows + blockSize - 1) / blockSize;

  auto h_A = std::make_unique<float[]>(numRows * numCols);
  auto h_y = std::make_unique<float[]>(numRows);

  int anyFixedSeed = 0;
  std::mt19937 rng(anyFixedSeed);
  std::uniform_real_distribution<float> dist(-1000.0f, 1000.0f);
  for (size_t i = 0; i < numRows * numCols; i++)
    h_A[i] = dist(rng);

  float *d_y;
  CC(cudaMalloc(&d_y, sizeof(float) * numRows));

  switch (mode) {
  case kGlobalMem: {
    float *d_A;
    CC(cudaMalloc(&d_A, sizeof(float) * numRows * numCols));
    CC(cudaMemcpy(d_A, h_A.get(), sizeof(float) * numRows * numCols,
                  cudaMemcpyDefault));
    sumRows<<<numBlocks, blockSize>>>(d_A, d_y, numRows, numCols);
    CC(cudaGetLastError());
    CC(cudaFree(d_A));
  } break;
  case kGlobalTransposed: {
    /* Transpose matrix on the host. */
    auto h_AT = std::make_unique<float[]>(numRows * numCols);
    for (size_t r = 0; r < numRows; r++)
      for (size_t c = 0; c < numCols; c++)
        h_AT[c * numRows + r] = h_A[r * numCols + c];
    float *d_AT;
    CC(cudaMalloc(&d_AT, sizeof(float) * numRows * numCols));
    CC(cudaMemcpy(d_AT, h_AT.get(), sizeof(float) * numRows * numCols,
                  cudaMemcpyDefault));
    sumRows<<<numBlocks, blockSize>>>(TransposedTag{}, d_AT, d_y, numRows,
                                      numCols);
    CC(cudaGetLastError());
    CC(cudaFree(d_AT));
  } break;
  case kTextureMem: {
    __TextureObject d_A;
    CC(__malloc2DTextureObject(&d_A, numCols, numRows));
    CC(__memcpy2DTextureObject(&d_A, h_A.get(), numCols, numRows));
    sumRows<<<numBlocks, blockSize>>>(d_A.tex, d_y, numRows, numCols);
    CC(cudaGetLastError());
    CC(__freeTextureObject(&d_A));
  } break;
  }

  CC(cudaMemcpy(h_y.get(), d_y, sizeof(float) * numRows, cudaMemcpyDefault));

  CC(cudaFree(d_y));

  return 0;
}

It’s also very surprising to me that texture memory gets that close to the coalesced implementation. As far as I know, 2D texture memory places vertically adjacent elements (in particular, an 2 x 16 bytes small matrix) within the same sector when fetched into the L1 cache (or, more precisely, the texture cache), so the number of requested sectors is reduced from 32 to 16 compared with the uncoalesced version. However, the reduction in cycle count is so significant that it’s very similar to the coalesced case. May I ask why texture memory behaves so differently from global memory in this specific case?