Non-ideal shared memory transactions?

I placed a similar question on Stackoverflow (http://stackoverflow.com/questions/28374796/bank-conflict-cuda-shared-memory, so far without useful replies.

I have a problem with shared memory usage on a Quadro K6000. I managed to reproduce the problem with the minimal example attached below, which is a simple copy from global → shared → global memory using a shared-memory array which might be padded on the right side (variable ng). The use of shared memory is not very useful here, but it illustrates the problem.

If I compile the code (-arch=sm_35, cuda 6.5.14) with ng=0 and study the shared memory access pattern with NVVP, it tells me that there are “no issues”. The same test with (for example) ng=2 returns “Shared Store Transactions/Access = 2, Ideal Transactions/Acces = 1” at the lines marked with “NVVP warning”.

What exactly is NVVP indicating with “Shared Store Transactions/Access = 2, Ideal Transactions/Acces = 1”? And why does it occur with ng>0, but not with ng=0?

__global__ void copy(double * const __restrict__ out, const double * const __restrict__ in, const int ni, const int nj, const int ng)

{
    extern __shared__ double as[];
    const int ij=threadIdx.x + threadIdx.y*blockDim.x;
    const int ijs=threadIdx.x + threadIdx.y*(blockDim.x+ng);

    as[ijs] = in[ij]; // NVVP warning
    __syncthreads();
    out[ij] = as[ijs]; // NVVP warning
}

int main()
{
    const int itot = 16;
    const int jtot = 16;
    const int ng = 2;
    const int ncells = itot * jtot;

    double *in  = new double[ncells];
    double *out = new double[ncells];
    double *tmp = new double[ncells];
    for(int n=0; n<ncells; ++n)
        in[n]  = 0.001 * (std::rand() % 1000) - 0.5;

    double *ind, *outd;
    cudaMalloc((void **)&ind,  ncells*sizeof(double));
    cudaMalloc((void **)&outd, ncells*sizeof(double));
    cudaMemcpy(ind,  in,  ncells*sizeof(double), cudaMemcpyHostToDevice);
    cudaMemcpy(outd, out, ncells*sizeof(double), cudaMemcpyHostToDevice);

    dim3 gridGPU (1, 1 , 1);
    dim3 blockGPU(16, 16, 1);

    copy<<<gridGPU, blockGPU, (itot+ng)*jtot*sizeof(double)>>>(outd, ind, itot, jtot, ng);

    cudaMemcpy(tmp, outd, ncells*sizeof(double), cudaMemcpyDeviceToHost);

    return 0;
}

If you systematically test with ng in 0 … 16, what do you find? I suspect that what NVVP reports is the consequence of bank conflicts in shared memory, but I am too lazy to work through the details of your addressing scheme to confirm or refute this hypothesis.

I quickly tested it for ng=0,1,2,3,4,16 (skipping 5, 6, etc), only ng=0 gives me “no issues”, the other the same warning.

To help figure out if it is a bank conflict, is the following thought on how memory is addressed correctly?

Keppler architecture: 32 banks of 8 byte
ij = index global memory, ijs = index shared memory

ng=0
1st row (threadIdx.y=0) has ij = 00 .. 15, ijs = 00 .. 15 in banks 00..15
2nd row (threadIdx.y=1) has ij = 16 .. 31, ijs = 16 .. 31 in banks 16..31
3rd row (threadIdx.y=2) has ij = 32 .. 47, ijs = 32 .. 47 in banks 00..15

ng=2
1st row (threadIdx.y=0) has ij = 00 .. 15, ijs = 00 .. 15 in banks 00..15
2nd row (threadIdx.y=1) has ij = 16 .. 31, ijs = 18 .. 33 in banks 18..31 + 00..01
3rd row (threadIdx.y=2) has ij = 32 .. 47, ijs = 36 .. 51 in banks 04..19

Would that be correct? Even for a blocksize of 32, there shouldn’t be a bank conflict?

ng=2
1st row (threadIdx.y=0) has ij = 00 .. 31, ijs = 00 .. 31 in banks 00..31
2nd row (threadIdx.y=1) has ij = 32 .. 63, ijs = 34 .. 65 in banks 02..31 + 00..01
3rd row (threadIdx.y=2) has ij = 46 .. 95, ijs = 68 .. 99 in banks 04..31 + 00..03

Your memory address analysis appears correct to me.
However, since the warp size is 32 and devices of compute capability >=2.0 perform shared memory accesses a warp at a time, for ng=2 you have a bank conflict between 1st and 2nd row. Both rows access banks 0 and 1, thus the conflict.

tera, Thanks for checking and clarifying things. But I still don’t think my problem is solved (or that I understand what is going on…):

  1. I thought that with a blockDim.x of 16, only 16 threads would be active within a warp. However, if I set itot=16, jtot=2 (with the same size thread block) only one warp is active, in which case I could understand the bank conflict.

  2. I checked it for the last example (itot=32, with also a thread block width of 32):

ng=2
1st row (threadIdx.y=0) has ij = 00 .. 31, ijs = 00 .. 31 in banks 00..31
2nd row (threadIdx.y=1) has ij = 32 .. 63, ijs = 34 .. 65 in banks 02..31 + 00..01
3rd row (threadIdx.y=2) has ij = 46 .. 95, ijs = 68 .. 99 in banks 04..31 + 00..03

NVVP gives me the same warning, but I again don’t understand how this could cause a bank conflict?

Disregard my last post, I forgot to set the size of the shared memory banks. With:

cudaDeviceSetSharedMemConfig(cudaSharedMemBankSizeEightByte);

It works for the second example. Great!

Not the case. Warps are always composed of 32 threads. Assuming your threadblock has 32 or more threads total, the first warp will always be 32 active threads. The assembly order of threads into a warp is covered in the programming guide.

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#thread-hierarchy

The only devices for which shared memory considerations were in the context of a half-warp is cc1.x devices. All other devices consider shared memory access conflicts within the scope of a full 32-thread warp.

64-bit bank mode is available on cc 3.5 and higher devices. 64-bit bank mode does not contravene the above statement, because, as stated in the programming guide:

“Successive 64-bit words map to successive banks.”

Therefore, even 64-bit accesses by all threads do not necessarily generate bank conflicts (in 64-bit bank mode) because the actual bank definition (byte granularity) changes.

Thanks for clarifying that. I’ll play a bit more with my minimal example to try to fully understand things.