Spatially separable 3D convolution

helange · September 24, 2021, 10:59pm

I am struggling to implement something that seems to be something very basic.

I need to 3Dconvolve a 4D matrix of shape (32x128x128x128) with a kernel k with shape (32,32,6,6,6). However, my convolutional kernel is spatially (but not over the channel dimension) separable in the sense that k[c_in,c_out,i,j,k] = k_x[c_in, c_out,i] * k_y[c_in, c_out,j] k_z[c_in, c_out,k].

In principle, the fact that the kernel is separable should lead to a significant speed up since:
L[c_out, x,y,z] = \sum_{c_in} \sum_{i} \sum_{j} \sum_{k} M[c_in, x+i, y+j, z+k] k_x[c_in, c_out,i] * k_y[c_in, c_out,j] k_z[c_in, c_out,k]

This entails that:
L[c_out, x,y,z] = \sum_{c_in} \sum_{i} k_x[c_in, c_out,i] \sum_{j} k_y[c_in, c_out,j] \sum_{k} M[c_in, x+i, y+j, z+k] k_z[c_in, c_out,k]

L[c_out, x,y,z] = \sum_{c_in} \sum_{i} k_x[c_in, c_out,i] \sum_{j} k_y[c_in, c_out,j] Lz[c_out, c_in, x+i, y_j, z]

L[c_out, x,y,z] = \sum_{c_in} \sum_{i} k_x[c_in, c_out,i] Ly[c_out, c_in, x+i, y, z]

with

Lz[c_out, c_in, x+i, y+j, z] = \sum_{k} M[c_in, x+i, y+j, z+k] k_z[c_in, c_out,k]
Ly[c_out, c_in, x+i, y, z] = \sum_{j} k_y[c_in, c_out,j] Lz[c_out, c_in, x+i, y_j, z]

I cannot construct Ly in memory because it would be too large (32x32x128x128x128). In order to minimize global memory reads I am convolving M[c_in = i] with k_z and k_y and then add this to L at the appropriate locations but this leads to a significant overhead in global memory writes.

This seems to be so basic which makes me believe (or gives me hope) that this must have been implemented somewhere. Has anyone achieved this?

Robert_Crovella · September 25, 2021, 2:26am

cuDNN can do 3D convolutions on a 4D tensor, however I wouldn’t be able to give you a roadmap and I’m not saying it takes into account the spatially separable kernel character, that seems to be the crux of your question.

There is a separate forum for cuDNN in case you are interested.

A simple google search turns up items like this and there is a separable convolution CUDA sample code, but it’s not designed with the dimensionality you describe.

Topic		Replies	Views
Simple 2d Convolution Low Pass filter like blur filter CUDA Programming and Performance	3	2819	April 15, 2014
3D Separable Kernel Some question CUDA Programming and Performance	0	873	March 16, 2009
Separable Convolution and Shared Memory CUDA Programming and Performance	3	2465	January 20, 2017
3D texture based separable convolution extension of SDK example CUDA Programming and Performance	1	1854	April 6, 2010
Why is my 'trivial' convolution kernel faster than cuDNN? CUDA Programming and Performance	4	452	May 29, 2022
separableConvolution mirroring edges CUDA Programming and Performance	0	492	January 17, 2017
Shared memory out of bounds (simple convolution) CUDA Programming and Performance	6	690	June 21, 2017
convolution using shared memory slowdown instead of speedup... CUDA Programming and Performance	1	5578	March 11, 2010
General Convolution CUDA Programming and Performance	7	2916	April 21, 2009
Convolution Texture with Shared Memory CUDA Programming and Performance	3	481	April 15, 2024

Spatially separable 3D convolution

Related topics