Shared Mem usage in Tx2 CUDA kernel

dumbogeorge · January 1, 2018, 7:29pm

Hi Folks,

I would like to learn, usage of shared mem in CUDA kernel. From different documents, from web, I got to know about Shared mem bank conflict.

I need to implement a simple frame difference kernel and then do a dilate operation on the diff. My approach, in the read frame pixes as 4 byte integers in order to optimize DDR/global mem accesses. Then I store the difference in Shared mem. At last I would like to examine neighborhood of a pixel, and combine them into final value.

When I try to read values from neighborhood (e.g. top left neighbor of a pixel) I run into ‘bus error’, which I am unable to get over , after looking into few examples in ~/NVIDIA_CUDA-8.0_Samples/3_Imaging.

Following is my code - may be some of you can spot the basic error easily.

__global__ void smem_diff_n_dilate(  int *output, int *input1,int *input2,
		const int outputWidth,
		const int outputHeight,
		const int inputWidthStep,
		const int outputWidthStep)
{
   const int cornerCol = blockDim.x*blockIdx.x;  
   const int cornerRow = blockDim.y*blockIdx.y;  
   const int globalCol = cornerCol + threadIdx.x;  
   const int globalRow = cornerRow + threadIdx.y;
   const int readsize       = 1;
   const int localFrameOffset = (blockDim.x * blockDim.y) * readsize;

   if (globalCol >= (outputWidth>>(readsize+1)) || (globalRow) >= (outputHeight)) return;

   int globalIdx_n      = ( ((globalRow * (outputWidthStep>>2)) ) + globalCol*readsize);  
   const int mask           = 0xff;
   int p1, p2, p3, p4, p5, outPixel1, outPixel2, outPixel3,threshold;


   __shared__ int sharedMem[8196];   
   // assume that we will put no more than 32 bytes/thread, from 
   // input frame into shared mem
   // block size == 32,32 == 1k threads, block has 32k memory
   // this means that each thread gets 32 bytes in shared mem
   // we have 5 frames to save in shared mem, lets put 4 bytes per frame per thread in shared mem

   int thread_idx = threadIdx.y * blockDim.x + threadIdx.x;
   // read pixels from first, 2nd, frame in shared mem
   p1 = input1[globalIdx_n]; thread_idx += localFrameOffset;  
   p2 = input2[globalIdx_n]; thread_idx += localFrameOffset;
   thread_idx = threadIdx.y * blockDim.x + threadIdx.x;


   sharedMem[thread_idx]      = ( (DIFF1(p1,p2)) | ((DIFF2(p1,p2)) << 8) | ((DIFF3(p1,p2)) << 16) | ((DIFF4(p1,p2))  << 24) ) ; 

   __syncthreads();  

   if ( (threadIdx.y-1 ) < 0) return ; if ( (threadIdx.y+1)  >= blockDim.y) return;
   if ( (threadIdx.x-1 ) < 0) return ; if ( (threadIdx.x+1)  >= blockDim.x) return;

   int idx_top_left          = (threadIdx.y-1) * blockDim.x + (threadIdx.x-1);
    
   output[globalIdx_n]       = sharedMem[thread_idx] | sharedMem[idx_top_left]	;

}
float smem_diff_n_dilate_entry(int *output,int  *input1,int  *input2,unsigned int widthY, unsigned int heightY , unsigned int pitch )
{
   
    //Specify a reasonable block size
    const dim3 block(32,32);
 
    //Calculate grid size to cover the whole image
    const dim3 grid(((pitch + block.x - 1)/block.x)>>2, (heightY + block.y - 1)/block.y);
 
    //Launch the size conversion kernel
    smem_diff_n_dilate<<<grid,block>>>(output,input1,input2,pitch,heightY,pitch,pitch);

    return 0;
}

My kernel works when I do not examine the neighboring pixel

output[globalIdx_n]       = sharedMem[thread_idx] ;

However it fails (bus error) with

output[globalIdx_n]       = sharedMem[thread_idx] | sharedMem[idx_top_left]	;

Please help.

AastaLLL · January 2, 2018, 2:22am

Hi,

Happy New Year!

Usually, the bus error is caused by invalid concurrent read/write.
Please check your implementation if any missing synchronous handling.

Thanks.

dumbogeorge · January 2, 2018, 2:16pm

Hi AastaLLL
Happy New Year to you and the Jetson fraternity !

I am unable to spot the synchronous handling issue. Is there an example which can be adapted to do convolution/erosion/dilation on live camera images (using argus/egl) ?

Thanks,

AastaLLL · January 3, 2018, 7:34am

Hi,

For image process example, please check our CUDA sample or VisionWorks library.

It’s recommended to double confirm the calculation of thread index.
A common issue is that program has the same index value in some condition.
Concurrent access of output[globalIdx_n] or sharedMem[thread_idx] also can lead to bus error.

Thanks

dumbogeorge · January 4, 2018, 7:10am

Hi AastaLLL,

Thanks for your help, and appreciate your intution. Indeed it was issue of indices going negative, on the boundary of blocks.

Replacing

int idx_top_left          = (threadIdx.y-1) * blockDim.x + (threadIdx.x-1);

with

int idx_top_left          = (threadIdx.y-1) * blockDim.x + (threadIdx.x-1); if (idx_top_left < 0) idx_top_left = 0;

fixes the issue.

Next I am looking to optimize, global (DDR) memory fetches from frame buffers. I want to make sure that frame reads are most optimally ‘coalesced’. Would you think - that - reading 8-bit frame pixels as int (group of 4 pixels) is most optimal way ?

p1 = input1[globalIdx_n]; // input1 == int pointer to frame buffer 1
   p2 = input2[globalIdx_n]; // input1 == int pointer to frame buffer 2

The two lines above, would issue a word fetch to DDR at two different (discontigous) addresses. Would it help if I fetch and process 8 or 16 pixels from each frame at the same time ?

Thanks

AastaLLL · January 8, 2018, 7:28am

Hi,

Sorry for the late reply.

coalesced is related to lots issue, ex. buffer size, kernel size, hardware resource.
It’s recommended to monitor GPU behavior via nvprof to get more information.

Thanks.

Topic		Replies	Views
Worse atomic performance in shared than global memory CUDA Programming and Performance	7	8961	August 3, 2017
Shared memory as slow as global memory CUDA Programming and Performance	8	4412	September 5, 2016
Using Shared Memory in CUDA C/C++ Technical Blog	36	1995	October 8, 2020
Image processing with CUDA: design question. CUDA Programming and Performance	5	1017	January 26, 2018
Please help with __shared__ memory different usage than in samples CUDA Programming and Performance	30	3313	January 10, 2010
Extremely slow smem reads Possible bank conflicts? CUDA Programming and Performance	11	3482	November 25, 2008
Why is the performance more? Refering to Dr Dobbs article CUDA Programming and Performance	10	2641	April 23, 2010
Dynamic Shared Memory Tesla vs Fermi CUDA Programming and Performance	0	999	May 26, 2011
passing different sizes arrays data to shared variables CUDA Programming and Performance	9	1621	May 4, 2010
How would you do this? CUDA Programming and Performance	12	4467	August 5, 2008

Shared Mem usage in Tx2 CUDA kernel

Related topics