I have an implementation which uses shared memory and NVVP shows information that I interpret as contradictory, but you guys will be able to correct me. The kernel is:
__global__ void Vel_Convert(const float * __restrict__ vrms, const float * __restrict__ time, float *vint, const uint32_t LENGTH)
{
extern __shared__ float SM_Vrms2Vint[]; // Shared memory space
float *SM_Vrms = &SM_Vrms2Vint[0], // Velocity portion of the memory
*SM_Time = &SM_Vrms2Vint[blockDim.x]; // Time portion of the memory
const uint32_t GLOBAL_IDX = blockDim.x * blockIdx.x + threadIdx.x,
OFFSET = gridDim.x * blockDim.x;
for(uint32_t idx = GLOBAL_IDX; idx < LENGTH; idx += OFFSET) // Fills the shared memory
{
SM_Vrms[threadIdx.x] = vrms[idx];
SM_Time[threadIdx.x] = time[idx];
}
__syncthreads();
if(threadIdx.x > 0) // Branch divergence here, thread 0 does different job
vint[GLOBAL_IDX] = sqrt(((SM_Time[threadIdx.x] * (SM_Vrms[threadIdx.x] * SM_Vrms[threadIdx.x])) -
(SM_Time[threadIdx.x - 1] * (SM_Vrms[threadIdx.x - 1] * SM_Vrms[threadIdx.x - 1]))) /
(SM_Time[threadIdx.x] - SM_Time[threadIdx.x - 1]));
else
vint[GLOBAL_IDX] = SM_Vrms[threadIdx.x];
}
The kernel is launched this way:
Vel_Convert <<< gSize, bSize, bSize * 2 * sizeof(float) >>> (rp_dVrms, rp_dTime, rp_dVint, NUM_LINES);
It takes 2 arrays as inputs and 1 as output as well as the length and I allocate shared memory for twice the value of block size and split it for 2 different input data that is used along the kernel.
NVVP reports “No Issues” in the Shared Memory Access Pattern (unguided analysis), while in the “Properties” tab to the right, which displays general information about the kernel, shows the yellow ! in “Efficiency/Shared Efficiency” reported as 39,3%.
When I see “No issues”, then I think there is no bank conflict, while the low efficiency can be associated to conflicts in various threads around the net. I suspect it is because of the way I read the current shared memory position (at threadIdx.x) and the position before it (at threadIdx.x - 1), which is needed for the calculation.
So I seek your advice to know if the low shared efficiency is due to bank conflicts (despite the “no issues” message) and how it can be improved.