Hi Folks,
I would like to learn, usage of shared mem in CUDA kernel. From different documents, from web, I got to know about Shared mem bank conflict.
I need to implement a simple frame difference kernel and then do a dilate operation on the diff. My approach, in the read frame pixes as 4 byte integers in order to optimize DDR/global mem accesses. Then I store the difference in Shared mem. At last I would like to examine neighborhood of a pixel, and combine them into final value.
When I try to read values from neighborhood (e.g. top left neighbor of a pixel) I run into ‘bus error’, which I am unable to get over , after looking into few examples in ~/NVIDIA_CUDA-8.0_Samples/3_Imaging.
Following is my code - may be some of you can spot the basic error easily.
__global__ void smem_diff_n_dilate( int *output, int *input1,int *input2,
const int outputWidth,
const int outputHeight,
const int inputWidthStep,
const int outputWidthStep)
{
const int cornerCol = blockDim.x*blockIdx.x;
const int cornerRow = blockDim.y*blockIdx.y;
const int globalCol = cornerCol + threadIdx.x;
const int globalRow = cornerRow + threadIdx.y;
const int readsize = 1;
const int localFrameOffset = (blockDim.x * blockDim.y) * readsize;
if (globalCol >= (outputWidth>>(readsize+1)) || (globalRow) >= (outputHeight)) return;
int globalIdx_n = ( ((globalRow * (outputWidthStep>>2)) ) + globalCol*readsize);
const int mask = 0xff;
int p1, p2, p3, p4, p5, outPixel1, outPixel2, outPixel3,threshold;
__shared__ int sharedMem[8196];
// assume that we will put no more than 32 bytes/thread, from
// input frame into shared mem
// block size == 32,32 == 1k threads, block has 32k memory
// this means that each thread gets 32 bytes in shared mem
// we have 5 frames to save in shared mem, lets put 4 bytes per frame per thread in shared mem
int thread_idx = threadIdx.y * blockDim.x + threadIdx.x;
// read pixels from first, 2nd, frame in shared mem
p1 = input1[globalIdx_n]; thread_idx += localFrameOffset;
p2 = input2[globalIdx_n]; thread_idx += localFrameOffset;
thread_idx = threadIdx.y * blockDim.x + threadIdx.x;
sharedMem[thread_idx] = ( (DIFF1(p1,p2)) | ((DIFF2(p1,p2)) << 8) | ((DIFF3(p1,p2)) << 16) | ((DIFF4(p1,p2)) << 24) ) ;
__syncthreads();
if ( (threadIdx.y-1 ) < 0) return ; if ( (threadIdx.y+1) >= blockDim.y) return;
if ( (threadIdx.x-1 ) < 0) return ; if ( (threadIdx.x+1) >= blockDim.x) return;
int idx_top_left = (threadIdx.y-1) * blockDim.x + (threadIdx.x-1);
output[globalIdx_n] = sharedMem[thread_idx] | sharedMem[idx_top_left] ;
}
float smem_diff_n_dilate_entry(int *output,int *input1,int *input2,unsigned int widthY, unsigned int heightY , unsigned int pitch )
{
//Specify a reasonable block size
const dim3 block(32,32);
//Calculate grid size to cover the whole image
const dim3 grid(((pitch + block.x - 1)/block.x)>>2, (heightY + block.y - 1)/block.y);
//Launch the size conversion kernel
smem_diff_n_dilate<<<grid,block>>>(output,input1,input2,pitch,heightY,pitch,pitch);
return 0;
}
My kernel works when I do not examine the neighboring pixel
output[globalIdx_n] = sharedMem[thread_idx] ;
However it fails (bus error) with
output[globalIdx_n] = sharedMem[thread_idx] | sharedMem[idx_top_left] ;
Please help.