# cuDNN custom convolution

Hello I would like to take 3d medical image and calculate the mean and standard deviation of each voxel’s neighberhood - so I would like a kernel that operates on a cube of data that is centered on each voxel in image, can I use cudnn to achieve this ? The pseudocode would look sth like below:

``````I - 900x900x900 // image data
dim- convDim = 5 // the size of convolution filter is sth I will set by trial and error
__global__ getLocalDat  (data = I(xa:xb, ya:yb, za:zb) )->  [mean,std] // kernel taking a cube of data around each voxel  where xa-xb= ya-yb=za-zb = dim and return vector with mean and standard deviation of neighberhood

flattened = flatten(data) // getting all data from a cube in convolution filter
return [mean(flattened), std(flattened) ]
}
``````

Of course I omitted all memory allocations, resolving bank conflicts as we need to access the same data for multiple kernels at once , uploading data to shared memory of block … as I honestly suppose that those tasks are already implemented as this is common problem for all convolutions am I right?

I would be very grateful for any help with this problem

The answer to this would be No, as there is no pre-compiled mean, std dev. kernel to use in cuDNN,
However, you may try using convolution and a sequence of operations for the same.

Thanks!

1 Like

Thanks! Ok and is there some state of the art 3 d cuda convolution example explained? I had found 2 d examples , but 3 d indexing remains tricky

Also I am wandering weather my idea is correct and how to achieve some points

1. divide the image into 3 dimensional block with some excess - padding that will be equal the half of edge length of kernel - and push those cubes of data to shared memory of each block - i suppose bigger blocks so with max amount of threads may be better as i would waste less memory on this padding
2. iterate over the data in a block in such a way that each thread at first pass will only analyze the non overlapping parts of data in order to prevent bank conflicts - it will require the amount of passes equal to number of voxels in a kernel, - in order to keep it safe from bank conflicts i planned to sync threads after each pass in the block - does it make sense ?

There are no existing kernel doing this. Depending on your perf requirement, something below may or maynot work for you:

1. if you do a single channel convolution with NxNxN filter each filled element with 1/N^3 value, the result will be basically the NxNxN neighborhood mean.

2. Similarly you can launch a pointwise multiply to get the input tensor square, then get the mean of neighborhood square similar to above.

3. eventually you can use this equation to find the std.

see Standard deviation - Wikipedia

Thank you!

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.