Attempt to access memory once (by the 0th thread out of 32) per warp causes memory access by all the threads

I tried to create a minimalistic reproduction of the behavior I am seeing. Consider the code given below:

#include <stdlib.h>

__global__ void vectorAdd(const float *a, const float *b, float *c, int n) {
    // Calculate the index for this thread
    int idx = threadIdx.x + blockDim.x * blockIdx.x;
    float d = 10.0;
    // Perform the vector addition if within bounds
    if (idx < n) {
        c[idx] = a[idx] + b[idx] +d;
    }
}

__global__ void vectorAdd1(const float *a, const float *b, float *c, float *data, int n) {
    // Calculate the index for this thread
    int idx = threadIdx.x + blockDim.x * blockIdx.x;
    float d = data[0];
    // Perform the vector addition if within bounds
    if (idx < n) {
        c[idx] = a[idx] + b[idx] + d;
    }
}

__global__ void vectorAdd2(const float *a, const float *b, float *c, float *data, int n) {
    // Calculate the index for this thread
    int idx = threadIdx.x + blockDim.x * blockIdx.x;
    float d = 10.0; 
    
    if (threadIdx.x % warpSize == 0) {
        d = data[0];
    }
    // Perform the vector addition if within bounds
    if (idx < n) {
        c[idx] = a[idx] + b[idx] + d;
    }
}

int main() {
    int n = 1<<19; // Size of vectors
    size_t bytes = n * sizeof(float);

    // Allocate host memory
    float *h_a = (float*)malloc(bytes);
    float *h_b = (float*)malloc(bytes);
    float *h_c = (float*)malloc(bytes);
    float *h_data = (float*)malloc(sizeof(float));

    // Initialize input vectors on host
    for (int i = 0; i < n; i++) {
        h_a[i] = i * 1.0f;
        h_b[i] = i * 2.0f;
    }
    h_data[0] = 10.0;

    // Allocate device memory for vectors
    float *d_a, *d_b, *d_c, *d_data;
    cudaMalloc(&d_a, bytes);
    cudaMalloc(&d_b, bytes);
    cudaMalloc(&d_c, bytes);
    cudaMalloc(&d_data, sizeof(float));

    // Copy data from host to device
    cudaMemcpy(d_a, h_a, bytes, cudaMemcpyHostToDevice);
    cudaMemcpy(d_b, h_b, bytes, cudaMemcpyHostToDevice);
    cudaMemcpy(d_data, h_data, sizeof(float), cudaMemcpyHostToDevice);

    // Launch kernel with indirect pointers
    int threads = 256;
    int blocks = (n + threads - 1) / threads;
    
    vectorAdd<<<blocks, threads>>>(d_a, d_b, d_c, n);
    vectorAdd1<<<blocks, threads>>>(d_a, d_b, d_c, d_data, n);
    vectorAdd2<<<blocks, threads>>>(d_a, d_b, d_c, d_data, n);
    
    // Copy result back to host
    cudaMemcpy(h_c, d_c, bytes, cudaMemcpyDeviceToHost);

    // Free memory
    free(h_a);
    free(h_b);
    free(h_c);
    free(h_data);
    cudaFree(d_a);
    cudaFree(d_b);
    cudaFree(d_c);
    cudaFree(d_data);
    return 0;
}

The first kernel is a basic vectorAdd kernel where I add two vectors a and b element-wise, and store the result in a third vector c. So, per thread we access global memory twice to read and one to write.


From the ncu timing diagram, there are 32.77 K Req of Read to Global Memory (I guess 32.77/2 = 16.38 k Req for each of a or b read)


In the second kernel, vectorAdd1 we read in the memory content of a vector containing just one element in each thread of the kernel. So, based on the previous logic, there are now 3 * 16.38 k (=49.15) Req of Read to Global Memory. (One for the data vector and other for vector a and b)


The third kernel vectorAdd2 tries to access data only for first thread in each warp, so I supposed that just for the access of the data vector, there should not be 16.38k but rather 16.38k/32 given that the memory is accessed once for every 32 threads. But contrary to my assumption, I find that the memory request made are same as that of vectorAdd1.

Why is it happening like this?

Some things to note:

  1. On a GPU, (or most modern CPUs) you don’t get to ask for just one byte from DRAM, or just 4 bytes from DRAM. When you ask for data from DRAM, the minimum request size on a modern GPU is 32 bytes.
  2. An instruction issued to a single thread in a warp is issued to all threads in the warp. Even when the conditional behavior of the source code suggests otherwise. We can get into the details of how that is handled with active masking, predication, etc. but the fundamental fact is that instructions in a CUDA GPU are all issued warp wide. Every instruction. Always.
  3. When multiple threads in a warp read the same location in global or shared memory, that item is only requested once, warp-wide, and broadcast to all threads in the warp that need it.

There is no difference between your vectorAdd1 and vectorAdd2 kernel, in terms of how data for d is accessed.

Sessions 1-4 of this online training series may help with certain details/aspects of GPU instruction issue behavior, particularly session 3.

1 Like

To understand the stuffs, let me do some math to tally the numbers in ncu.

Based on that, since vectors a and b are of type float-32, (4 bytes), and since the elements of the vectors are accessed one after the other, so a memory request for accessing an element of say a would bring in 128 Bytes of data (L1 cache line size), i.e. 32 (consequtive) elements of a, which would aid 32 threads (or a warp). Similarly for b. So, 32 threads or a warp would effectively make 2 memory read requests in the vectorAdd kernel case. So, 2048 * 256 threads in total or 16384 (=2048 * 256/32) warps would make 16384 * 2 = 32768 read memory requests to the global memory. 128 Bytes is composed of 4 sectors (the 32 byte data chuck as mentioned)


32768 memory requests of 128 bytes each accounts for the 4194304 Bytes of load.
Since 32768 memory requests are of 128 Bytes each or 4 sectors, so total number of sectors = 32768 * 4 = 131072


Now let us move on the case of vectorAdd1,

Based on that, I suppose for data[0]a memory request of a sector size (i.e. 32 Bytes as it is the mininum, as mentioned), is made per warp, in addition to what is already requested for read in vectorAdd. So in this case, 3 memory requests are made per warp, so 16384 warps make 49152 read memory requests.
Reads of a and b are handled by 2 memory request of 4 sectors each per warp, and read of data is handled by 1 memory request of 1 sector per warp. So we make a total of (24 + 1 =) 9 sector request per warp, i.e. 16384 * 9 = 147456 sectors in total, i.e. (14745632=)471859 Bytes of data. [ 9 sector request through 3 memory requests make 3 sectors/req]

Along the same direct, not exactly related but I wanted to understand something a bit more. For the code below:

#include <stdlib.h>
__global__ void vectorAdd3_1(const float *a, const float *b, float *c, double *data, int n) {
    // shared memory for data
    __shared__ double d;
    
    // Calculate the index for this thread
    int idx = threadIdx.x + blockDim.x * blockIdx.x;
     
    // Only the first thread in the block loads the data to shared memory
    if (threadIdx.x == 0) {
        d = data[0];
    }
    // Synchronize to ensure shared memory is initialized
    __syncthreads();

    // Perform the vector addition if within bounds
    if (idx < n) {
        c[idx] = a[idx] + b[idx] + (float)d;
    }
}

__global__ void vectorAdd3_2(const float *a, const float *b, float *c, double *data, double* data1, int n) {
    // shared memory for data
    __shared__ double d;
    __shared__ double d1;
    
    // Calculate the index for this thread
    int idx = threadIdx.x + blockDim.x * blockIdx.x;
     
    // Only the first thread in the block loads the data to shared memory
    if (threadIdx.x == 0) {
        d = data[0];
        d1 = data1[0];
    }
    // Synchronize to ensure shared memory is initialized
    __syncthreads();

    // Perform the vector addition if within bounds
    if (idx < n) {
        c[idx] = a[idx] + b[idx] + (float)(d + d1);
    }
}

__global__ void vectorAdd3_3(const float *a, const float *b, float *c, double *data, double* data1, double* data2, int n) {
    // shared memory for data
    __shared__ double d;
    __shared__ double d1;
    __shared__ double d2;
    
    // Calculate the index for this thread
    int idx = threadIdx.x + blockDim.x * blockIdx.x;
     
    // Only the first thread in the block loads the data to shared memory
    if (threadIdx.x == 0) {
        d = data[0];
        d1 = data1[0];
        d2 = data2[0];
    }
    // Synchronize to ensure shared memory is initialized
    __syncthreads();

    // Perform the vector addition if within bounds
    if (idx < n) {
        c[idx] = a[idx] + b[idx] + (float)(d + d1 + d2);
    }
}


int main() {
    int n = 1<<19; // Size of vectors
    size_t bytes = n * sizeof(float);

    // Allocate host memory
    float *h_a = (float*)malloc(bytes);
    float *h_b = (float*)malloc(bytes);
    float *h_c = (float*)malloc(bytes);
    double *h_data = (double*)malloc(sizeof(double));
    // few more data for testing
    double *h_data1 = (double*)malloc(sizeof(double));
    double *h_data2 = (double*)malloc(sizeof(double));

    // Initialize input vectors on host
    for (int i = 0; i < n; i++) {
        h_a[i] = i * 1.0f;
        h_b[i] = i * 2.0f;
    }
    h_data[0] = 10.0;
    // few more data for testing
    h_data1[0] = 20.0;
    h_data2[0] = 30.0;

    // Allocate device memory for vectors
    float *d_a, *d_b, *d_c; double *d_data;
    cudaMalloc(&d_a, bytes);
    cudaMalloc(&d_b, bytes);
    cudaMalloc(&d_c, bytes);
    cudaMalloc(&d_data, sizeof(double));
    // few more data for testing
    double *d_data1, *d_data2;
    cudaMalloc(&d_data1, sizeof(double));
    cudaMalloc(&d_data2, sizeof(double));

    // Copy data from host to device
    cudaMemcpy(d_a, h_a, bytes, cudaMemcpyHostToDevice);
    cudaMemcpy(d_b, h_b, bytes, cudaMemcpyHostToDevice);
    cudaMemcpy(d_data, h_data, sizeof(double), cudaMemcpyHostToDevice);
    // few more data for testing
    cudaMemcpy(d_data1, h_data1, sizeof(double), cudaMemcpyHostToDevice);
    cudaMemcpy(d_data2, h_data2, sizeof(double), cudaMemcpyHostToDevice);

    // Launch kernel with indirect pointers
    int threads = 256;
    int blocks = (n + threads - 1) / threads;
    
    vectorAdd3_1<<<blocks, threads>>>(d_a, d_b, d_c, d_data, n);
    vectorAdd3_2<<<blocks, threads>>>(d_a, d_b, d_c, d_data, d_data1, n);
    vectorAdd3_3<<<blocks, threads>>>(d_a, d_b, d_c, d_data, d_data1, d_data2, n);
    
    // Copy result back to host
    cudaMemcpy(h_c, d_c, bytes, cudaMemcpyDeviceToHost);

    // Free memory
    free(h_a);
    free(h_b);
    free(h_c);
    free(h_data);
    cudaFree(d_a);
    cudaFree(d_b);
    cudaFree(d_c);
    cudaFree(d_data);
    return 0;
}

Since you mentioned about shared memory, I wanted to test few stuffs about share memory as well using some similar code.

vectorAdd3 is a kernel which tries to load the data to shared memory in the first thread of each block, and then each thread in the block tries to read from that shared memory location.
vectorAdd3_1 loads in 1 double element to shared memory. While vectorAdd3_2 and vectorAdd3_3 loads 2 and 3 double elements to the shared memory respectively.

What I cannot understand is the math behind the memory requests to and from the shared memory:

For vectorAdd3_1:


In this case, reading of a and b makes 2 memory requests per warp, making a totoal of 2*16384 requests or 32768 memory request of 128 bytes each.
For the read of data there is 1 memory request of a sector per block, so there are total of 2048 memory requests.

That makes up the total of 32768 + 2048 = 34816 memory request to global memory.

For each of the 16384 warps, there is a single memory read request from the shared memory making up 16384 memory read request to shared memory. I suppose there is a single memory request because the looking into the SASS, it makes use of LDS.U.64

For each block, we have a single store to shared memory, making up a total of 2048 store requests to the shared memory, the single memory request is due to STS.64


For vectorAdd3_2

There is 32768 memory requests for the read of a and b.
For each of data and data1 we make 1 memory request of 32 bytes each, per block. So, that makes 2048 * 2 = 4096 memory requests.

That is a total of 32768 + 4096 = 36864 memory read requests to the global memory.

But in case of shared memory, it seems that there are still 1 read request per warp (making a total of 16384), because now it makes use of LDS.U.128 to load the two 64 bit values in together.

Similarly, for the shared memory store, I find 1 store request per block (as in the vectorAdd3_1), because it makes use of STS.128 to store two 64 bit values in one go.


For vectorAdd3_3

For read of a and b there are 32768 memory read requests. And for data, data1 and data2 we have 3 memory read requests per block, i.e. 3 * 2048 = 6144. So a total of 32768 + 6144 = 38912 memory requests.

What I could not understand is the shared memory requests.
How are there memory read requests per warp, I expected that each 64 bit value is placed along two banks in shared memory. So, three 64 bits words could be stored in 6 consecutive banks of the shared memory — and possibly could be loaded in just one go due to the bank level parallelism.

But that is true, if say different threads are reading values from different banks, and each is reading say 32 bit, 64 bit or 128 bit.

But for a single thread we are constrainted by the amount of data a particular instruction can tacle (which I guess 128 bits). So, the SASS, packs the first two 64 bits as a 128 value and the third as a seperate 64 bit value. And hence launches two stores STS.128and STS.64 respectively. And launches two loads, LDS.U.128 and LDS.U.64.
That makes up for the two load request per warp and 2 store request per warp.


Not sure, if it helps:

The ld PTX instruction has an option .level::prefetch_size
https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-ld

The .level::prefetch_size qualifier is a hint to fetch additional data of the specified size into the respective cache level.The sub-qualifier prefetch_size can be set to either of 64B, 128B, 256B thereby allowing the prefetch size to be 64 Bytes, 128 Bytes or 256 Bytes respectively.

The qualifier .level::prefetch_size may only be used with .global state space and with generic addressing where the address points to .global state space. If the generic address does not fall within the address window of the global memory, then the prefetching behavior is undefined.

The .level::prefetch_size qualifier is treated as a performance hint only.

Perhaps it explains larger memory transfers from a single thread.