Non-ideal shared memory transactions?

julietbravo · February 7, 2015, 12:57pm

I placed a similar question on Stackoverflow (http://stackoverflow.com/questions/28374796/bank-conflict-cuda-shared-memory, so far without useful replies.

I have a problem with shared memory usage on a Quadro K6000. I managed to reproduce the problem with the minimal example attached below, which is a simple copy from global → shared → global memory using a shared-memory array which might be padded on the right side (variable ng). The use of shared memory is not very useful here, but it illustrates the problem.

If I compile the code (-arch=sm_35, cuda 6.5.14) with ng=0 and study the shared memory access pattern with NVVP, it tells me that there are “no issues”. The same test with (for example) ng=2 returns “Shared Store Transactions/Access = 2, Ideal Transactions/Acces = 1” at the lines marked with “NVVP warning”.

What exactly is NVVP indicating with “Shared Store Transactions/Access = 2, Ideal Transactions/Acces = 1”? And why does it occur with ng>0, but not with ng=0?

__global__ void copy(double * const __restrict__ out, const double * const __restrict__ in, const int ni, const int nj, const int ng)

{
    extern __shared__ double as[];
    const int ij=threadIdx.x + threadIdx.y*blockDim.x;
    const int ijs=threadIdx.x + threadIdx.y*(blockDim.x+ng);

    as[ijs] = in[ij]; // NVVP warning
    __syncthreads();
    out[ij] = as[ijs]; // NVVP warning
}

int main()
{
    const int itot = 16;
    const int jtot = 16;
    const int ng = 2;
    const int ncells = itot * jtot;

    double *in  = new double[ncells];
    double *out = new double[ncells];
    double *tmp = new double[ncells];
    for(int n=0; n<ncells; ++n)
        in[n]  = 0.001 * (std::rand() % 1000) - 0.5;

    double *ind, *outd;
    cudaMalloc((void **)&ind,  ncells*sizeof(double));
    cudaMalloc((void **)&outd, ncells*sizeof(double));
    cudaMemcpy(ind,  in,  ncells*sizeof(double), cudaMemcpyHostToDevice);
    cudaMemcpy(outd, out, ncells*sizeof(double), cudaMemcpyHostToDevice);

    dim3 gridGPU (1, 1 , 1);
    dim3 blockGPU(16, 16, 1);

    copy<<<gridGPU, blockGPU, (itot+ng)*jtot*sizeof(double)>>>(outd, ind, itot, jtot, ng);

    cudaMemcpy(tmp, outd, ncells*sizeof(double), cudaMemcpyDeviceToHost);

    return 0;
}

njuffa · February 7, 2015, 2:12pm

If you systematically test with ng in 0 … 16, what do you find? I suspect that what NVVP reports is the consequence of bank conflicts in shared memory, but I am too lazy to work through the details of your addressing scheme to confirm or refute this hypothesis.

julietbravo · February 7, 2015, 3:10pm

I quickly tested it for ng=0,1,2,3,4,16 (skipping 5, 6, etc), only ng=0 gives me “no issues”, the other the same warning.

To help figure out if it is a bank conflict, is the following thought on how memory is addressed correctly?

Keppler architecture: 32 banks of 8 byte
ij = index global memory, ijs = index shared memory

ng=0
1st row (threadIdx.y=0) has ij = 00 .. 15, ijs = 00 .. 15 in banks 00..15
2nd row (threadIdx.y=1) has ij = 16 .. 31, ijs = 16 .. 31 in banks 16..31
3rd row (threadIdx.y=2) has ij = 32 .. 47, ijs = 32 .. 47 in banks 00..15

ng=2
1st row (threadIdx.y=0) has ij = 00 .. 15, ijs = 00 .. 15 in banks 00..15
2nd row (threadIdx.y=1) has ij = 16 .. 31, ijs = 18 .. 33 in banks 18..31 + 00..01
3rd row (threadIdx.y=2) has ij = 32 .. 47, ijs = 36 .. 51 in banks 04..19

Would that be correct? Even for a blocksize of 32, there shouldn’t be a bank conflict?

ng=2
1st row (threadIdx.y=0) has ij = 00 .. 31, ijs = 00 .. 31 in banks 00..31
2nd row (threadIdx.y=1) has ij = 32 .. 63, ijs = 34 .. 65 in banks 02..31 + 00..01
3rd row (threadIdx.y=2) has ij = 46 .. 95, ijs = 68 .. 99 in banks 04..31 + 00..03

tera · February 7, 2015, 4:41pm

Your memory address analysis appears correct to me.
However, since the warp size is 32 and devices of compute capability >=2.0 perform shared memory accesses a warp at a time, for ng=2 you have a bank conflict between 1st and 2nd row. Both rows access banks 0 and 1, thus the conflict.

julietbravo · February 7, 2015, 5:20pm

tera, Thanks for checking and clarifying things. But I still don’t think my problem is solved (or that I understand what is going on…):

I thought that with a blockDim.x of 16, only 16 threads would be active within a warp. However, if I set itot=16, jtot=2 (with the same size thread block) only one warp is active, in which case I could understand the bank conflict.
I checked it for the last example (itot=32, with also a thread block width of 32):

ng=2
1st row (threadIdx.y=0) has ij = 00 .. 31, ijs = 00 .. 31 in banks 00..31
2nd row (threadIdx.y=1) has ij = 32 .. 63, ijs = 34 .. 65 in banks 02..31 + 00..01
3rd row (threadIdx.y=2) has ij = 46 .. 95, ijs = 68 .. 99 in banks 04..31 + 00..03

NVVP gives me the same warning, but I again don’t understand how this could cause a bank conflict?

julietbravo · February 7, 2015, 5:26pm

Disregard my last post, I forgot to set the size of the shared memory banks. With:

cudaDeviceSetSharedMemConfig(cudaSharedMemBankSizeEightByte);

It works for the second example. Great!

Robert_Crovella · February 7, 2015, 5:31pm

Not the case. Warps are always composed of 32 threads. Assuming your threadblock has 32 or more threads total, the first warp will always be 32 active threads. The assembly order of threads into a warp is covered in the programming guide.

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#thread-hierarchy

The only devices for which shared memory considerations were in the context of a half-warp is cc1.x devices. All other devices consider shared memory access conflicts within the scope of a full 32-thread warp.

64-bit bank mode is available on cc 3.5 and higher devices. 64-bit bank mode does not contravene the above statement, because, as stated in the programming guide:

“Successive 64-bit words map to successive banks.”

Therefore, even 64-bit accesses by all threads do not necessarily generate bank conflicts (in 64-bit bank mode) because the actual bank definition (byte granularity) changes.

julietbravo · February 7, 2015, 6:02pm

Thanks for clarifying that. I’ll play a bit more with my minimal example to try to fully understand things.

Topic		Replies	Views
A basic question about shared memory conflict of a simple example CUDA Programming and Performance	4	2277	October 4, 2013
dont understand bank conflicts for shared mem CUDA Programming and Performance	7	2632	March 31, 2010
Shared memory bank conflict CUDA Programming and Performance	1	288	May 19, 2024
Evaluating SMem Access Pattern and SMem Efficiency in NVVP CUDA Programming and Performance	23	1292	November 29, 2018
How to understand the bank conflict of shared_mem CUDA Programming and Performance	12	10139	January 16, 2025
Requesting clarification for Non contiguous shared memory access by threads of a warp with no bank conflicts CUDA Programming and Performance hw , cuda	5	395	February 21, 2024
Requesting clarification for Shared Memory Bank Conflicts and Shared memory access? CUDA Programming and Performance hw , cuda	11	4129	January 23, 2024
float4 Shared memory doesn't yield bank conflict according to nvprof when it should CUDA Programming and Performance	4	1927	January 13, 2024
Shared Memory "Bank Conflicts" I'am confused... CUDA Programming and Performance	11	3469	August 20, 2009
Shared Memory Bank Size mode 4 Byte VS 8 byte Kepler CUDA Programming and Performance	9	4945	May 27, 2015

Non-ideal shared memory transactions?

Related topics