Shared Mem usage in Tx2 CUDA kernel

Hi Folks,

I would like to learn, usage of shared mem in CUDA kernel. From different documents, from web, I got to know about Shared mem bank conflict.

I need to implement a simple frame difference kernel and then do a dilate operation on the diff. My approach, in the read frame pixes as 4 byte integers in order to optimize DDR/global mem accesses. Then I store the difference in Shared mem. At last I would like to examine neighborhood of a pixel, and combine them into final value.

When I try to read values from neighborhood (e.g. top left neighbor of a pixel) I run into ‘bus error’, which I am unable to get over , after looking into few examples in ~/NVIDIA_CUDA-8.0_Samples/3_Imaging.

Following is my code - may be some of you can spot the basic error easily.

__global__ void smem_diff_n_dilate(  int *output, int *input1,int *input2,
		const int outputWidth,
		const int outputHeight,
		const int inputWidthStep,
		const int outputWidthStep)
   const int cornerCol = blockDim.x*blockIdx.x;  
   const int cornerRow = blockDim.y*blockIdx.y;  
   const int globalCol = cornerCol + threadIdx.x;  
   const int globalRow = cornerRow + threadIdx.y;
   const int readsize       = 1;
   const int localFrameOffset = (blockDim.x * blockDim.y) * readsize;

   if (globalCol >= (outputWidth>>(readsize+1)) || (globalRow) >= (outputHeight)) return;

   int globalIdx_n      = ( ((globalRow * (outputWidthStep>>2)) ) + globalCol*readsize);  
   const int mask           = 0xff;
   int p1, p2, p3, p4, p5, outPixel1, outPixel2, outPixel3,threshold;

   __shared__ int sharedMem[8196];   
   // assume that we will put no more than 32 bytes/thread, from 
   // input frame into shared mem
   // block size == 32,32 == 1k threads, block has 32k memory
   // this means that each thread gets 32 bytes in shared mem
   // we have 5 frames to save in shared mem, lets put 4 bytes per frame per thread in shared mem

   int thread_idx = threadIdx.y * blockDim.x + threadIdx.x;
   // read pixels from first, 2nd, frame in shared mem
   p1 = input1[globalIdx_n]; thread_idx += localFrameOffset;  
   p2 = input2[globalIdx_n]; thread_idx += localFrameOffset;
   thread_idx = threadIdx.y * blockDim.x + threadIdx.x;

   sharedMem[thread_idx]      = ( (DIFF1(p1,p2)) | ((DIFF2(p1,p2)) << 8) | ((DIFF3(p1,p2)) << 16) | ((DIFF4(p1,p2))  << 24) ) ; 


   if ( (threadIdx.y-1 ) < 0) return ; if ( (threadIdx.y+1)  >= blockDim.y) return;
   if ( (threadIdx.x-1 ) < 0) return ; if ( (threadIdx.x+1)  >= blockDim.x) return;

   int idx_top_left          = (threadIdx.y-1) * blockDim.x + (threadIdx.x-1);
   output[globalIdx_n]       = sharedMem[thread_idx] | sharedMem[idx_top_left]	;

float smem_diff_n_dilate_entry(int *output,int  *input1,int  *input2,unsigned int widthY, unsigned int heightY , unsigned int pitch )
    //Specify a reasonable block size
    const dim3 block(32,32);
    //Calculate grid size to cover the whole image
    const dim3 grid(((pitch + block.x - 1)/block.x)>>2, (heightY + block.y - 1)/block.y);
    //Launch the size conversion kernel

    return 0;

My kernel works when I do not examine the neighboring pixel

output[globalIdx_n]       = sharedMem[thread_idx] ;

However it fails (bus error) with

output[globalIdx_n]       = sharedMem[thread_idx] | sharedMem[idx_top_left]	;

Please help.


Happy New Year!

Usually, the bus error is caused by invalid concurrent read/write.
Please check your implementation if any missing synchronous handling.


Hi AastaLLL
Happy New Year to you and the Jetson fraternity !

I am unable to spot the synchronous handling issue. Is there an example which can be adapted to do convolution/erosion/dilation on live camera images (using argus/egl) ?



For image process example, please check our CUDA sample or VisionWorks library.

It’s recommended to double confirm the calculation of thread index.
A common issue is that program has the same index value in some condition.
Concurrent access of output[globalIdx_n] or sharedMem[thread_idx] also can lead to bus error.


Hi AastaLLL,

Thanks for your help, and appreciate your intution. Indeed it was issue of indices going negative, on the boundary of blocks.


int idx_top_left          = (threadIdx.y-1) * blockDim.x + (threadIdx.x-1);


int idx_top_left          = (threadIdx.y-1) * blockDim.x + (threadIdx.x-1); if (idx_top_left < 0) idx_top_left = 0;

fixes the issue.

Next I am looking to optimize, global (DDR) memory fetches from frame buffers. I want to make sure that frame reads are most optimally ‘coalesced’. Would you think - that - reading 8-bit frame pixels as int (group of 4 pixels) is most optimal way ?

p1 = input1[globalIdx_n]; // input1 == int pointer to frame buffer 1
   p2 = input2[globalIdx_n]; // input1 == int pointer to frame buffer 2

The two lines above, would issue a word fetch to DDR at two different (discontigous) addresses. Would it help if I fetch and process 8 or 16 pixels from each frame at the same time ?



Sorry for the late reply.

coalesced is related to lots issue, ex. buffer size, kernel size, hardware resource.
It’s recommended to monitor GPU behavior via nvprof to get more information.