A basic question about shared memory conflict of a simple example

Hello all,

I did a basic test: 1 thread block, 128 threads in the thread block. A shared memory buffer of 128 integer elements. The kernel is as following:

[i]#include <stdio.h>
#include “cuda_runtime_api.h”

global void smem_offset_test(int * out)
{
shared float var_arr[128];
float sum = 0;

if(threadIdx.x == 0)
{
	for(int i = 0; i < 128; i++)
	{
		var_arr[i] = i;
	}
}

__syncthreads();

for(int i = 0; i < 128; i++)
{
sum += var_arr[i];
}

out[threadIdx.x] = sum;

return;

}[/i]

The NVVP shows that for the load of shared memory is 2-way conflicted caused by the loop of “sum += var_arr[i];”. I don’t understand the reason.

My understanding is, 128 threads makeup 4 warps. For each warp, all threads read same element for each iteration. I check the ptx file, but I didn’t find any clue that I can understand.

The whole source code is as following. Any feedback is appreciated.

Susan

///////////////////////////////////////////////////
///test.cu

#include <stdio.h>
#include “cuda_runtime_api.h”

global void smem_offset_test(int * out)
{
shared int var_arr[128];
int sum = 0;

if(threadIdx.x == 0)
{
	for(int i = 0; i < 128; i++)
	{
		var_arr[i] = i;
	}
}

__syncthreads();

for(int i = 0; i < 128; i++)
{
	sum += var_arr[i];
}

out[threadIdx.x] = sum;

return;

}

int main()
{
int * d_data, * h_data;
cudaMalloc((void**)&d_data, sizeof(int) * 128);
h_data = (int *)malloc(sizeof(int) * 128);

smem_offset_test<<<1, 128>>>(d_data);

    cudaThreadSynchronize();

cudaMemcpy(h_data, d_data, sizeof(int) * 128, cudaMemcpyDeviceToHost);

cudaFree(d_data);
free(h_data);

return 0;

}

Sorry I forgot to post platform info:

GPU: Tesla K20c
CUDA 5.0 V0.2.1221 Built on Fri_Sep_21_17:28:58_PDT_2012
OS: GNU/Linux

Please help me to understand why this 2-way conflict exist.

Susan

It’s a good question! I assume you’re concerned about the ~50% “Shared Memory Efficiency” column?

You would think the second loop would consist entirely of broadcast loads and achieve 100% efficiency.

But inspecting the SASS shows us ptxas is emitting “LDS.128” ops – probably because the values are adjacent in shared memory.

Furthermore, there are 2 “Shared Memory Load Transactions Per Request” and 64 “Shared Load Transactions” which, to me, implies that the K20c is loading 64-bits at a time and the 128-bit request is broken into two 64-bit transactions.

So, perhaps this really isn’t a problem and the “Shared Memory Efficiency” statistic is misleading in this case because the large request is efficiently broken into smaller transactions supported by the K20c. i.e. The K20c is still broadcast loading efficiently.

Furthermore, if you coax nvcc/ptxas into using 32-bit “LDS” loads then you’ll see the shared memory efficiency jump and the “Shared Load Transactions Per Request” will drop to 1. Note that this is probably worse performing code (the loop is no longer unrolled):

for (int j=0; j<16; j++)
    for(int i=j; i<128; i+=16)
      sum += var_arr[i];

Be aware that your single-thread shared store initialization loop is also using a 128-bit op for shared stores (“STS.128”).

I’m sure someone from NVIDIA can provide a final proper interpretation of these NVVP metrics.

Shared memory transaction size is 256 bytes for Kepler. Any shared memory request that results in accessing more than 256 bytes will be split in multiple transactions of 256 bytes. The events shared_load_replay and shared_store_replay count extra transactions for a request which will include shared bank conflicts as well.

in nvvp, shared_efficiency metric is evaluated as (100*shared_memory_requests)/shared_memory_transactions which will give inaccurate results for 32 and 128 bit shared memory accesses.

For 128 bit accesses there will be 2 transactions per shared memory request(if all threads in a warp are accessing 128 bit, it results 512 bytes access by a warp). Hence the shared_efficiency will be reported as 50% but that is not correct. The shared memory request has 100% efficiency.

For 32 bit accesses there will be 1 transaction/shared memory request and hence the shared_efficiency will be reported as 100% but that is not correct too. Actual efficiency in this case is 50% as the request results in 128 byte access that is half of the maximum transaction size.

This is a known issue and it will be fixed in a future CUDA release.

Thank you both for such helpful info. It is much more clear to me now.

Susan