Shared memory bank conflict

itsme2030 · May 19, 2024, 4:05am

I thought the following simple code (just for test purposes) would generate bank conflicts on V100, but Nsight-compute says there are no bank conflicts.

In the code, I allocated a shared memory of size (sizeof(double)x32x32) for a matrix L, and its column values are copied from d_L in the for loop.

Since each column has 32 double values (256 bytes) and V100 has 32 banks each of which is of size 4 byte, each column values will be stored in two rows of banks. If this is correct, I thought that L[n*j + tx] = d_L[n*j + tx] should generate a two-way bank conflict for each j=0,..,n-1, but Nsight-compute says there are no bank conflicts as you see in the captured image below.

Could anyone shed some light on this?

__global__
void bank_conflict(int n, double *d_L, double *d_out)
{
    extern __shared__ double s[];
    double *L;
    int tx = threadIdx.x;

    L = s;
    for (int j=0; j<n; j++) {
        L[n*j + tx] = d_L[n*j + tx];
    }

    d_out[tx] = L[tx];
}

int main(int argc, char **argv)
{
    int n = 32;
    double *h_out, *h_L, *d_L, *d_out;

    cudaHostAlloc((void **)&h_L, sizeof(double)*(n*n), cudaHostAllocPortable);
    cudaHostAlloc((void **)&h_out, sizeof(double)*(n), cudaHostAllocPortable);
    cudaMalloc((void **)&d_L, sizeof(double)*(n*n));
    cudaMalloc((void **)&d_out, sizeof(double)*(n));

    for (int j=0; j<n; j++) {
        for (int i=0; i<n; i++) {
            h_L[n*j + i] = 5.0;
        }
    }
    cudaMemcpy(d_L, h_L, sizeof(double)*(n*n), cudaMemcpyHostToDevice);

    bank_conflict<<<1, n, sizeof(double)*(n*n)>>>(n, d_L, d_out);

    cudaMemcpy(h_out, d_out, sizeof(double)*n, cudaMemcpyDeviceToHost);
    for (int j=0; j<n; j++) {
        printf("h_out[%d] = %e\n", j, h_out[j]);
    }

    cudaFreeHost(h_L);
    cudaFreeHost(h_out);
    cudaFree(d_L);
    cudaFree(d_out);
    return 0;
}

Robert_Crovella · May 19, 2024, 6:18pm

There are no bank conflicts when retrieving 8-byte (or for that matter 16-byte) adjacent quantities in shared memory, because the memory controller breaks the request into two separate transaction, or in the modern speak, two wavefronts. So effectively the first 16 threads in the warp have a transaction, and the second 16 threads in the warp have a transaction (or wavefront) and shared memory bank conflicts are only considered with respect to other threads in the transaction/wavefront. If you then map this out, you will discover that the first 16 threads do not double-up in any bank.

Topic		Replies	Views
dont understand bank conflicts for shared mem CUDA Programming and Performance	7	2629	March 31, 2010
Shared Memory "Bank Conflicts" I'am confused... CUDA Programming and Performance	11	3468	August 20, 2009
Does this code cause bank conflicts? Nsight Compute cuda , kernel	4	1343	September 6, 2024
shared memory bank conflicts cc 2.0 CUDA Programming and Performance	3	893	December 29, 2011
How to understand the bank conflict of shared_mem CUDA Programming and Performance	12	9991	January 16, 2025
Bank Conflicts CUDA Programming and Performance	2	1962	December 6, 2009
Shared memory with compute capability 3.x (in 32-bit mode) or compute capability 5.x and 6.x CUDA Programming and Performance	5	974	November 17, 2017
bank conflict in cuda's parallel prefix scan GPU-Accelerated Libraries	1	1889	February 12, 2016
float4 Shared memory doesn't yield bank conflict according to nvprof when it should CUDA Programming and Performance	4	1925	January 13, 2024
Bank Conflicts and Serialized Warps CUDA Programming and Performance	6	7806	December 4, 2009

Shared memory bank conflict

Related topics