Shared memory bank conflict

I thought the following simple code (just for test purposes) would generate bank conflicts on V100, but Nsight-compute says there are no bank conflicts.

In the code, I allocated a shared memory of size (sizeof(double)x32x32) for a matrix L, and its column values are copied from d_L in the for loop.

Since each column has 32 double values (256 bytes) and V100 has 32 banks each of which is of size 4 byte, each column values will be stored in two rows of banks. If this is correct, I thought that L[n*j + tx] = d_L[n*j + tx] should generate a two-way bank conflict for each j=0,..,n-1, but Nsight-compute says there are no bank conflicts as you see in the captured image below.

Could anyone shed some light on this?

__global__
void bank_conflict(int n, double *d_L, double *d_out)
{
    extern __shared__ double s[];
    double *L;
    int tx = threadIdx.x;

    L = s;
    for (int j=0; j<n; j++) {
        L[n*j + tx] = d_L[n*j + tx];
    }

    d_out[tx] = L[tx];
}

int main(int argc, char **argv)
{
    int n = 32;
    double *h_out, *h_L, *d_L, *d_out;

    cudaHostAlloc((void **)&h_L, sizeof(double)*(n*n), cudaHostAllocPortable);
    cudaHostAlloc((void **)&h_out, sizeof(double)*(n), cudaHostAllocPortable);
    cudaMalloc((void **)&d_L, sizeof(double)*(n*n));
    cudaMalloc((void **)&d_out, sizeof(double)*(n));

    for (int j=0; j<n; j++) {
        for (int i=0; i<n; i++) {
            h_L[n*j + i] = 5.0;
        }
    }
    cudaMemcpy(d_L, h_L, sizeof(double)*(n*n), cudaMemcpyHostToDevice);

    bank_conflict<<<1, n, sizeof(double)*(n*n)>>>(n, d_L, d_out);

    cudaMemcpy(h_out, d_out, sizeof(double)*n, cudaMemcpyDeviceToHost);
    for (int j=0; j<n; j++) {
        printf("h_out[%d] = %e\n", j, h_out[j]);
    }

    cudaFreeHost(h_L);
    cudaFreeHost(h_out);
    cudaFree(d_L);
    cudaFree(d_out);
    return 0;
}

There are no bank conflicts when retrieving 8-byte (or for that matter 16-byte) adjacent quantities in shared memory, because the memory controller breaks the request into two separate transactions, or in the modern speak, two wavefronts. So effectively the first 16 threads in the warp have a transaction, and the second 16 threads in the warp have a transaction (or wavefront) and shared memory bank conflicts are only considered with respect to other threads in the transaction/wavefront. If you then map this out, you will discover that the first 16 threads do not double-up in any bank.

@Robert_Crovella I came across this thread and had two follow up questions.

  1. It is true that decomposing these shared memory access will not cause “excessive wave fronts” in Nsight Compute because they are strictly necessary. But if either wave has bank conflicts then you will get “excessive wave fronts”.

  2. If one were accessing 16-bytes from each thread then would the access be broken into 4 wave fronts of 4 threads each?

That would be my expectation. I haven’t double-checked it in nsight compute, and if it were somehow different than what I expect, I wouldn’t be able to explain it. I suggest questions that are specific to nsight compute (how exactly will nsight compute report this condition?) be asked in the nsight compute forum. But I stand by my statements, regarding how the GPU breaks those requests into transactions or “wavefronts”, and the result that it does not count as a bank conflict.

It would be broken into 4 wavefronts of 8 threads each. 4 transactions x 8threads/transaction = 32 threads in a warp. 8 threads x 16bytes/thread = 128 bytes maximum per wavefront.

At least at one point in time assessing shared bank conflicts using nsight compute was not always trivial. Things may have changed since then, but I generally refer folks to the nsight compute experts rather than try to delve into nsight compute behavior myself, as far as the reporting of shared bank conflicts goes, and how exactly to assess the readout.

1 Like

Hello, I saw the following rule in another answer. Does it apply to devices running cc 8.0-8.9? And where can I find the document for this rule?
—————————————————————————————————————————————————
On CC 7.0 - 7.5 devices shared memory loads with uniform addresses can increase bandwidth if the following is true:

  • Thread pairs (Tn and Tn^1) have the same addresses for all active threads (i.e. T0==T1, T2==T3, T4==T5, T6==T7, etc.), or
  • Thread pairs (Tn and Tn^2) have the same addresses for all active threads (i.e. T0==T2, T1==T3, T4==T6, T5==T7, etc.)
  • (Note these encompass the case where all active threads have the same address)