The Kernel
__global__ void sumRows(float *__restrict__ A, float *__restrict__ y,
size_t numRows, size_t numCols) {
int row = blockIdx.x * blockDim.x + threadIdx.x;
float yy = 0;
for (size_t col = 0; col < numCols; col++)
yy += A[row * numCols + col];
y[row] = yy;
}
Each thread in a warp performs a reduction over its corresponding row of the matrix.
Since accesses to A are fully uncoalesced, each load spans 32 sectors of different cache lines (assuming numCols = 128). As long as those cache lines remain resident, subsequent loads (within the same cache sector, 8 elements per sector) should ideally hit in L1 cache.
Expectation
I ran this kernel with numRows = 131072 and numCols = 128.
I would expect that long scoreboard stalls occur only at the first FADD, since the subsequent three loads should be L1 cache hits and therefore fast.
Surprisingly, the stalls remain significant.
I have confirmed that the L1 hit rate is 87.5%, i.e., 7/8, indicating that the cache lines are not being evicted.
Questions
- In the unrolled version of this loop, the elements loaded by successive instructions belong to the same cache sector.
- If the first load misses and fetches the sector from DRAM, do the following loads also issue DRAM requests?
- Or will they wait for the sector to arrive and then read from L1 cache without additional memory traffic?
- I also observed an interesting scaling pattern in kernel cycle count. When varying the number of rows:
- On an RTX 3050 Laptop GPU (16 SM Ă— 128 CUDA cores), the total cycles increase in steps of 2048 rows.
- On an RTX 4080 GPU (76 SM), the step size becomes 9728 rows.
- In contrast, a column-sum version of the kernel (where accesses are coalesced) shows smooth scaling with no such steps.
Why does this stepwise increase occur? It seems to align with the total number of CUDA cores, but I would appreciate any insight into why that correlation appears.
Environment
I’m using a laptop with an NVIDIA GeForce RTX 3050 Laptop GPU (Ampere architecture, compute capability 8.6).
The code is compiled with:
$ nvcc -arch=sm_86 -O3 report.cu -o report.out
Compiler and driver details:
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Tue_Oct_29_23:50:19_PDT_2024
Cuda compilation tools, release 12.6, V12.6.85
Build cuda_12.6.r12.6/compiler.35059454_0
$ gcc --version
gcc (Ubuntu 13.1.0-8ubuntu1~22.04) 13.1.0
- Driver version: 560.35.03
- CUDA version: 12.6
- OS: Ubuntu 22.04, kernel 6.8.0-85-generic
The report.cu file:
Summary
#include <cstdlib>
#include <cuda.h>
#include <iostream>
#include <memory>
#include <random>
#ifndef NUM_COLS
#define NUM_COLS 128
#endif
#ifndef BLOCK_SIZE
#define BLOCK_SIZE 128
#endif
__global__ void sumRows(float *__restrict__ A, float *__restrict__ y,
size_t numRows, size_t numCols) {
int row = blockIdx.x * blockDim.x + threadIdx.x;
float yy = 0;
for (size_t col = 0; col < numCols; col++)
yy += A[row * numCols + col];
y[row] = yy;
}
#define CC(call) \
{ \
cudaError_t err = call; \
if (err != cudaSuccess) { \
std::cerr << "CUDA error in " << __FILE__ << " at line " << __LINE__ \
<< ": " << cudaGetErrorString(err) << std::endl; \
std::exit(EXIT_FAILURE); \
} \
}
int main(int argc, const char *argv[]) {
if (argc != 2) {
std::cerr << "usage: " << argv[0] << " numRows" << std::endl;
std::exit(EXIT_FAILURE);
}
const auto numRows = std::atoi(argv[1]);
const auto numCols = NUM_COLS;
const auto blockSize = BLOCK_SIZE;
const auto numBlocks = (numRows + blockSize - 1) / blockSize;
auto h_A = std::make_unique<float[]>(numRows * numCols);
auto h_y = std::make_unique<float[]>(numRows);
int anyFixedSeed = 0;
std::mt19937 rng(anyFixedSeed);
std::uniform_real_distribution<float> dist(-1000.0f, 1000.0f);
for (size_t i = 0; i < numRows * numCols; i++)
h_A[i] = dist(rng);
float *d_A, *d_y;
CC(cudaMalloc(&d_y, sizeof(float) * numRows));
CC(cudaMalloc(&d_A, sizeof(float) * numRows * numCols));
CC(cudaMemcpy(d_A, h_A.get(), sizeof(float) * numRows * numCols,
cudaMemcpyDefault));
sumRows<<<numBlocks, blockSize>>>(d_A, d_y, numRows, numCols);
CC(cudaGetLastError());
CC(cudaMemcpy(h_y.get(), d_y, sizeof(float) * numRows, cudaMemcpyDefault));
CC(cudaFree(d_A));
CC(cudaFree(d_y));
return 0;
}







