Hello I would like to take 3d medical image and calculate the mean and standard deviation of each voxel’s neighberhood - so I would like a kernel that operates on a cube of data that is centered on each voxel in image, can I use cudnn to achieve this ? The pseudocode would look sth like below:
I - 900x900x900 // image data
dim- convDim = 5 // the size of convolution filter is sth I will set by trial and error
__global__ getLocalDat (data = I(xa:xb, ya:yb, za:zb) )-> [mean,std] // kernel taking a cube of data around each voxel where xa-xb= ya-yb=za-zb = dim and return vector with mean and standard deviation of neighberhood
flattened = flatten(data) // getting all data from a cube in convolution filter
return [mean(flattened), std(flattened) ]
}
Of course I omitted all memory allocations, resolving bank conflicts as we need to access the same data for multiple kernels at once , uploading data to shared memory of block … as I honestly suppose that those tasks are already implemented as this is common problem for all convolutions am I right?
I would be very grateful for any help with this problem
The answer to this would be No, as there is no pre-compiled mean, std dev. kernel to use in cuDNN,
However, you may try using convolution and a sequence of operations for the same.
Thanks! Ok and is there some state of the art 3 d cuda convolution example explained? I had found 2 d examples , but 3 d indexing remains tricky
Also I am wandering weather my idea is correct and how to achieve some points
divide the image into 3 dimensional block with some excess - padding that will be equal the half of edge length of kernel - and push those cubes of data to shared memory of each block - i suppose bigger blocks so with max amount of threads may be better as i would waste less memory on this padding
iterate over the data in a block in such a way that each thread at first pass will only analyze the non overlapping parts of data in order to prevent bank conflicts - it will require the amount of passes equal to number of voxels in a kernel, - in order to keep it safe from bank conflicts i planned to sync threads after each pass in the block - does it make sense ?
There are no existing kernel doing this. Depending on your perf requirement, something below may or maynot work for you:
if you do a single channel convolution with NxNxN filter each filled element with 1/N^3 value, the result will be basically the NxNxN neighborhood mean.
Similarly you can launch a pointwise multiply to get the input tensor square, then get the mean of neighborhood square similar to above.
eventually you can use this equation to find the std.