Memory optimal approach to Z dimension in separable 3D convolution.

CudaaduC · July 12, 2015, 5:15am

Running a standard discrete separable 3D gaussian convolution on a 768x768x32 input set, but the convolution down the Z dimension is taking about 4x the time for the X, or Y dimension.
While this is somewhat to be expected, I wonder what coalesced memory approach may be applied?

Have already went through the related code in the CUDA SDK, looked around a bit via Google checked out the first results like this:

[url]Browse code samples | Microsoft Docs

[url]http://www-igm.univ-mlv.fr/~biri/Enseignement/MII2/Donnees/convolutionSeparable.pdf[/url]

Most published work focuses on the 2D example, while it ends up that the third dimension (assuming contigous column major format) is the bottleneck.

Using my current shared memory approach running time is decent, but I suspect there is significant room for improvement in the third Z dimension kernel.

Any ideas/links/papers/code to that specific topic?

Clochette · July 12, 2015, 6:09am

Afaik CUDA 3D texture already uses a space-filling curve approach under the hood, so spatially close points in 3D will more than likely be mapped to array indices that are close, but I could be wrong.

If the accuracy isn’t a concern and/or your dataset has smooth structures, you can look into approximate bilateral filtering (which, when operating on 2D grayscale images, IS 3D filtering)/high-dimensional filtering approaches. One such example is http://inf.ufrgs.br/~eslgastal/AdaptiveManifolds/Gastal_Oliveira_SIGGRAPH2012_Adaptive_Manifolds.pdf

CudaaduC · July 12, 2015, 9:18pm

Janet,

Thanks, but in my situation I need 32 bit accuracy and have to implement this filter so it returns the exact same result as MATLAB convn().

What is interesting is that the number and placement of __syncthreads() around the global memory reads and writes is having a massive impact on running time. This seems to be because the number and placement of __syncthreads() impacts the number of registers used, which impacts occupancy.

Even then I do not need the __syncthreads() at those locations (in the typical sense), adding them improves register utilization and performance by as much as 50%.

Oh, and Janet please wait until 2017 to raise interest rates, because my Granny owns Government bonds and needs to preserve her buying power. The TIPS to 10 year spread (breakeven yield) seems to be moving up, but is it enough to justify a rate increase?

[url]http://www.nasdaq.com/symbol/rinf/stock-chart[/url]

Clochette · July 12, 2015, 10:06pm

What if you do the convolution on X and Y dimension, transpose the data, then do the convolution on Z dimension?

I suspect the __syncthreads() is stopping the compiler from “caching” global memory in registers. Do you see the same if you declare the global memory variable volatile?

CudaaduC · July 12, 2015, 10:24pm

That is a great idea which I had not yet thought about, and probably will work.
Thanks.

Topic		Replies	Views
when to use shared memory CUDA Programming and Performance	0	2259	March 10, 2009
Disappointing shared memory performance CUDA Programming and Performance	3	734	September 8, 2011
3 x 16 thread block runs faster than 16 x 3 why is that? CUDA Programming and Performance	8	9995	April 3, 2007
Convolution Texture with Shared Memory CUDA Programming and Performance	3	481	April 15, 2024
Memory Coalescing CUDA Programming and Performance	7	2873	July 29, 2009
Uncoalesced memory access vs. more loads from GM Optimization advice needed! CUDA Programming and Performance	2	5797	September 2, 2008
Local memory performance Using more than 4kb kills it.. why? CUDA Programming and Performance	24	5076	September 6, 2008
Interpretation of Coalesced Global memory access for 3d Block Is it coalesced only if tid is used?? CUDA Programming and Performance	2	3161	November 23, 2011
Please help with __shared__ memory different usage than in samples CUDA Programming and Performance	30	3309	January 10, 2010
cudaMemcpy into shared variables CUDA Programming and Performance	12	5261	September 23, 2009

Memory optimal approach to Z dimension in separable 3D convolution.

Related topics