Strided thread access with data reuse in cache lines

Hi,

I created some test code to better understand the effects of memory coalescing and the access of global data on bandwidth and ultimately performance. From what I understand from this section of the guide:

https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#coalesced-access-to-global-memory

Strided access leads to poor bandwidth utilization, because the hardware fetches data in chunks of 32 bytes (compute capability 6.0 and after), and the threads essentially don’t use all that data, leading to wasted bandwidth.

My question is, how does this “unused” data, propagate to the caches, if at all? From several posts such as this one, it appears that the L1 and L2 caches have 128 byte tags and the size of the lines are 128 bytes with 4 32 byte sectors. So in the case of strided access, do unused elements in the 32 byte chunks of global memory make it to these lines?

In this simple add routine below, the threads of a warp access data “THREAD_STRIDE” apart. However, the second t-loop then walks through the data in the gaps. My intuition was that the threads would cause an uncoalesced load when the execute the i-loop, but had the full 32 byte sector from global memory made it to the L1 and L2 caches, then the “t-loop” should not lead to additional global transactions since the data should now be found in the caches? Is this correct?

#include <iostream>
#include <math.h>
#include <cuda_profiler_api.h>
#include <string>
#include <cassert>
#include "helper_cuda.h"
#include <vector>

template<int THREAD_STRIDE>
__global__ void add(int n, float *x, float *y)
{
  int index = (blockIdx.x * blockDim.x + threadIdx.x) * THREAD_STRIDE;
  int block_stride = THREAD_STRIDE * blockDim.x * gridDim.x;

      
  for (int i = index; i < n; i += block_stride) {
    for (int t = i; t < THREAD_STRIDE + i; t++) {
      y[t] = x[t] + y[t];
    }
  }
      
}

void init(int N, float *x, float *y) {

  std::vector<float> xh(N, 1.0f);
  std::vector<float> yh(N, 2.0f);
  
  checkCudaErrors(cudaMemcpy(x, &xh[0], N*sizeof(float), cudaMemcpyHostToDevice));
  checkCudaErrors(cudaMemcpy(y, &yh[0], N*sizeof(float), cudaMemcpyHostToDevice));
}

int main(int argc, char** argv)
{
  constexpr int N_STRIDES = 4;
  constexpr int N_BLOCK_SIZE = 32;
  int N = N_STRIDES*1024*1024;
  float *x, *y;

  cudaMalloc(&x, N*sizeof(float));
  cudaMalloc(&y, N*sizeof(float));

  int num_thread_blocks = N/N_BLOCK_SIZE/N_STRIDES;

  for (int i = 0; i < 100; i++) {
    init(N, x, y);
    add<1><<<num_thread_blocks, N_BLOCK_SIZE>>>(N, x, y);
    checkCudaErrors( cudaPeekAtLastError() );
     init(N, x, y);
    add<N_STRIDES><<<num_thread_blocks, N_BLOCK_SIZE>>>(N, x, y);
    checkCudaErrors( cudaPeekAtLastError() );
  }
 

  // Free memory
  cudaFree(x);
  cudaFree(y);

  cudaProfilerStop();
  cudaDeviceReset();

  return 0;
}

In add<1>, the threads all read in a coalesced manner, and the i-loop is executed 4 times per thread and the t-loop is executed once. In add<4>, the threads read in a strided manner, and the i-loop is executed once but the t-loop is executed 4 times. Here is what NVPROF tells me. I can understand the number of transactions per global request being 4 times higher for the strided add. However, I don’t quite understand why the number of load and store SM requests are identical in the two cases? I would expected the strided version to have a fourth of the requests since we read through the same chunks of memory but loaded 4 times as much as in the i-loop, and in the t-loop we should have found that in the cache. I find it weird that with caching, the number of global transactions is 4 times higher for add<4> ?

Invocations                               Metric Name                                          Metric Description         Min         Max         Avg
Device "Tesla V100-PCIE-32GB (0)"
    Kernel: void add<int=1>(int, float*, float*)
        100              gld_transactions_per_request                        Global Load Transactions Per Request    4.000000    4.000000    4.000000
        100                  shared_load_transactions                                    Shared Load Transactions           0           0           0
        100                          gld_transactions                                    Global Load Transactions     1048576     1048576     1048576
        100                    dram_read_transactions                             Device Memory Read Transactions     1048295     1048651     1048493
        100                          dram_utilization                                   Device Memory Utilization    High (7)    High (7)    High (7)
        100                      dram_read_throughput                               Device Memory Read Throughput  362.78GB/s  384.70GB/s  375.35GB/s
        100                      l2_global_load_bytes        Bytes read from L2 for misses in L1 for global loads    33554432    33554432    33554432
        100                      global_load_requests    Total number of global load requests from Multiprocessor      262144      262144      262144
        100               l2_local_global_store_bytes    Bytes written to L2 from L1 for local and global stores.    16777216    16777216    16777216
        100                     global_store_requests   Total number of global store requests from Multiprocessor      131072      131072      131072
        100                          gst_transactions                                   Global Store Transactions      524288      524288      524288
    Kernel: void add<int=4>(int, float*, float*)
        100              gld_transactions_per_request                        Global Load Transactions Per Request   15.999191   15.999832   15.999587
        100                  shared_load_transactions                                    Shared Load Transactions           0           0           0
        100                          gld_transactions                                    Global Load Transactions     4194092     4194260     4194195
        100                    dram_read_transactions                             Device Memory Read Transactions     1048583     1048727     1048610
        100                          dram_utilization                                   Device Memory Utilization    High (8)    High (8)    High (8)
        100                      dram_read_throughput                               Device Memory Read Throughput  431.41GB/s  438.95GB/s  435.44GB/s
        100                      l2_global_load_bytes        Bytes read from L2 for misses in L1 for global loads    39401984    40883584    40109300
        100                      global_load_requests    Total number of global load requests from Multiprocessor      262144      262144      262144
        100               l2_local_global_store_bytes    Bytes written to L2 from L1 for local and global stores.    67108864    67108864    67108864
        100                     global_store_requests   Total number of global store requests from Multiprocessor      131072      131072      131072
        100                          gst_transactions                                   Global Store Transactions     2097152     2097152     2097152

In CUPTI for GV100 the term

  • requests equals instructions executed.
  • transactions equals l1 (32B) or l2 (32B) sector access

The two examples execute the same number of GLD and GST instructions.

The GV100 LSU pipe (global/local) can resolve address divergence for all 32 threads in the warp in one cycle. The limitation is the address must be in at most 4 cache lines.

The metric gld_transactions_per_request is the number of unique L1 sectors accessed (transactions) divided by the number of global load instructions executed (request).

Thanks @Greg . I’m still having some trouble forming a mental picture of what is happening. If both examples execute the same number of global load and store instructions, I would expect add<4> and add<1> to have identical number of unique L1 transactions.

  • In add<1> A thread block with 32 threads, processes 4 - 128 byte chunks in the kernel. Because the access is with a unit stride, I would expect each 128 byte access request to load 4 32 byte sectors in a coalesced fashion (4 transactions per request). We process 4 - 128 byte chunks (i-loop) so a single thread block leads to 16 gld transactions and 4 requests from the SM. Is this correct?

  • In add<4> A thread block with 32 threads, processes a single 512 byte chunk. When the first t-loop iterations is hit, an access of 128 bytes with a stride of 4 should trigger 16 32 byte sectors to be loaded into 4 L1 cache lines. However accessing the remaining 384 bytes should not trigger any further new L1 sector loads. All the data should be found in the same 4 cache lines we pulled in the first iteration. Therefore having processed 512 bytes, we still only loaded 16 unique sectors, same as add<1>

However, gld_transactions_per_request is clearly 4 times higher for add<4> compared to add<1>, which contradicts my mental model, given that the number of GLD and LST instructions are identical

add<1>

  • Each load reads 4 32B sectors from the L1
  • All loads miss L1

add<4>

  • Each load reads 16 32B sectors from the L1.
  • 2nd - 4th iteration of t will likely hit in the L1. This still counts as a read…

gld_transactions is number sectors reads from L1 not from L2.

The conditions and increments on the i and t loop are not straight forward. I would highly recommend simplifying your conditions to make it easier to read.

Thanks for the clarification.

“2nd - 4th iteration of t will likely hit in the L1. This still counts as a read…”

Does your comment then imply that even though iterations 2nd-4th in add<4>, read sectors already loaded in L1 in the first iteration, they count towards gld_transcaction_per_request, and are classified as unique reads? I ask this because of the definition you shared earlier:

The metric gld_transactions_per_request is the number of unique L1 sectors accessed (transactions) divided by the number of global load instructions executed (request).

I agree that the conditions in the loop are not straightforward, but I have not come across a case like this discussed on the forum or elsewhere. This is a watered down example of the real problem I am trying to solve. Thanks for your patience!