White output in Sobel Filter

Hi I am trying to implement a sobel filter using Shared memory, but am getting a black output image. I have tried tweaking the code but it still doesnt give the output. Following is the code snippet:

#define tilewidth 14;
#define blockwidth 16;

global void d_SobelShared(unsigned char org,unsigned char result, int width, int height, const int restrict Gx, const int restrict Gy ){

    int tx = threadIdx.x;
    int ty = threadIdx.y;
    int maskRadius=maskWidth/2;
    __shared__ char img_ds[blockwidth][blockwidth];
    int rowO = blockIdx.y*tilewidth+ty;
    int colO = blockIdx.x*tilewidth+tx;
    int rowI = rowO-maskRadius;
    int colI = colO-maskRadius;

    if( (rowI>=0) && (colI>=0) && (rowI<height) && (colI<width))
            img_ds[ty][tx] = org[rowI*width+colI];
            img_ds[ty][tx] = 0;

    float opX=0,opY=0;
    int sum=0,i=0,j=0;
    if(ty<tilewidth && tx<tilewidth)
            for( i=0;i<maskWidth;i++){
                    for( j=0;j<maskWidth;j++){
                            opX=opX+img_ds[i+ty][j+tx]*Gx [(i+maskRadius)*maskWidth+(j+maskRadius)];
                            opY=opY+img_ds[i+ty][j+tx]*Gy [(i+maskRadius)*maskWidth+(j+maskRadius)];
    sum = abs(opX) + abs(opY) ;
       if ( sum > 255) sum = 255;
            if ( sum < 0) sum = 0;
if(tx<tilewidth && ty<tilewidth && rowO<height && colO<width)


Any help is greatly appreciated…

i am not sure whether i agree with, + tx; + ty in:

int rowO = blockIdx.ytilewidth+ty;
int colO = blockIdx.x

so, with the following:


  • width might push/ land you outside of the image, i think

does the debugger, or something similar, complain about memory access out of bounds?
just make certain of the addressing

if you are confident about the addressing, perhaps put a breakpoint just after the immediate __syncthreads() following:

img_ds[ty][tx] = org[rowI*width+colI];

(mouse-over and) peak at img_ds to see whether you actually have real data going into your filter

then, just after the __syncthreads(), and before:


write sum to shared memory first, and peak at that too, to see whether your filter provides a valid output

I did run a cuda-memcheck and got a set of 28, Address is out of bounds, errors. Following is 1 of it:
========= Invalid global read of size 4
========= at 0x000001e0 in d_SobelShared
========= by thread (3,14,0) in block (13,0,0)
========= Address 0x503260024 is out of bounds

All the exceptions were in block(13,0,0). I am at loss in trying to find a solution.

although i have not tried it before, i recall that cuda (nsight eclipse) has remote debugging capabilities - i am sure i have run into mention of remote debugging many a time when switching to the debugger
if i am not mistaken, one can select between debugging a project as a local application, or a remote application: right-click on the program, and note: debug as…>>

if you only have control over the code, and not the debugger, i suppose you could simply look at the kernel ‘in stages’
commenting out sections of the kernel, and moving on such a code mask, should be synonymous to using a breakpoint; the only difference would be that instead of peaking at shared memory on the fly, you would ‘peak’ at your output array, ‘not on the fly’

“All the exceptions were in block(13,0,0)”

if your addresses calculated are wrong, this is a likely outcome
the former blocks manage to end within bounds even though their addresses might be wrong
the latter blocks do not have the liberty of having blocks succeeding them, and thus calculate out of bounds addresses

if the mask extends beyond a element by maskradius then:

(rowI<height) && (colI<width)

is too far to prevent out of bounds, i think

I gave it a thought and made some few tweaks to the kernel. Following is another approach to Sobel.

global void SobelFilter(unsigned char g_DataIn, unsigned char g_DataOut, int width, int height,const int restrict Gx, const int restrict Gy ){
shared unsigned char sharedMem[(blockwidth+maskRad)(blockwidth+maskRad)];
int x = blockIdx.x * tilewidth + threadIdx.x; //- maskRad;
int y = blockIdx.y * tilewidth + threadIdx.y; //- maskRad;
unsigned int index = y
unsigned int sharedIndex = threadIdx.y * blockDim.y + threadIdx.x;
if( (x-maskRad>=0) && (y-maskRad>=0) && (y-maskRad<height) && (x-maskRad<width))
sharedMem[sharedIndex] = g_DataIn[index];
float sum=0;
float sumX=0,sumY=0;
if((threadIdx.x>=maskRad) && (threadIdx.x<(blockwidth-maskRad)) && (threadIdx.y>=maskRad) && (threadIdx.y<(blockwidth-maskRad)))
//float sumX=0,sumY=0;
for(int dy=-maskRad; dy<=maskRad; ++dy)
for(int dx =-maskRad; dx <= maskRad; ++dx)
float Pixel = (float)(sharedMem[sharedIndex+(dyblockDim.x+dx)]);
sumX += Pixel
Gx [(dy+maskRad)maskWidth+(dx+maskRad)];
sumY += Pixel
Gy [(dy+maskRad)*maskWidth+(dx+maskRad)];

	sum=(abs(sumX) + abs(sumY));
   	        if(sum>255) sum=255;
    	if(sum<0) sum=0;
g_DataOut[y*width + x] =sum;


Although now i am getting the image, the edges of every block*block tile of the image is also greyed. Any thoughts on this. Thanks again for the help:) Following is the link to the output image that i am getting:

i kind of like the superimposed grid on top of the image

by the way, i believe that image itself is surely 546 years old; i remember spotting it in a machine vision textbook once; the latter dates from the year 43 bc

you generally should allow the filter to extend the block, if it still falls within the image

your read-in now is probably of the form:

if mask block == within thread block : read image data

it should probably be:

if mask block == (completely) within image : read image data

should amount to:


if ((mask_upper_corner > (0;0)) && (mask_lower_corner < (width; height)))


this is rather crude in explanation, but i think you get the picture

I couldn’t get you. The mask should be applied to tilewidth*tilewidth times of the image. I did try tweaking it but am still getting the same ooutput

no, you are getting there:

shared char img_ds[blockwidth][blockwidth]; // initial


shared unsigned char sharedMem[(blockwidth+maskRad)*(blockwidth+maskRad)]; // latest

your boundary conditions:

if((threadIdx.x>=maskRad) && (threadIdx.x<(blockwidth-maskRad)) && (threadIdx.y>=maskRad) && (threadIdx.y<(blockwidth-maskRad)))

are really for boundary blocks/ tiles - guarding that boundary blocks do not attempt to read image data that is not there to be read

the ordinary case is a block/ tile, completely surrounded by other tiles
in such a case, the block likely would not need boundary guards, as it would simply extend reading into surrounding blocks, provided that the mask radius is not greater than the tile width
now, for an ordinary block to apply the filter to its block/ tile, it generally would consume:
(block_width + (mask_radius * 2)) * (block_height + (mask_radius * 2)) data
that is, the ordinary block, plus the mask extended on all 4 sides of the block

your extension of the shared memory allocation is already an improvement, but i think it should be mask width, instead of mask radius:

shared unsigned char sharedMem[(blockwidth+maskRad)*(blockwidth+maskRad)];

the fact that the block needs to consume more data than its own size - the number of threads that it has - and given the fact that you use shared memory, means that you would need to steer your block to read in all required data
also, you need to update your boundary guards to only mind the image boundaries
in both cases, the easiest might be first calculate the absolute coordinates of the block (this you have done already), and also the absolute coordinates of the block’s mask
you would use the block mask absolute coordinates to implement the boundary guards, and to read in data

the block mask coordinates should be along the lines of:

(block starting row - mask radius; block starting column - mask radius) // mask_upper_coordinate
(block ending row + mask radius; block ending column + mask radius) // mask_lower_coordinate

your boundary guards then become:

mask_upper_coordinate :: (0;0)
mask_lower_coordinate :: (image height;image width)

clearly, only the blocks on the edges of the image would now be subject to trimming
which is significantly different to:

if((threadIdx.x>=maskRad) && (threadIdx.x<(blockwidth-maskRad)) && (threadIdx.y>=maskRad) && (threadIdx.y<(blockwidth-maskRad)))

in the above, you enforce block coordinates rather than image coordinates, i think
threadIdx.x; threadIdx.y would imply all blocks are clipped, regardless of whether they have surrounding blocks, and can thus extend into surrounding blocks
therefore the reason your graciously superimposed grid