I am struggling to implement something that seems to be something very basic.

I need to 3Dconvolve a 4D matrix of shape (32x128x128x128) with a kernel `k`

with shape (32,32,6,6,6). However, my convolutional kernel is spatially (but not over the channel dimension) separable in the sense that `k[c_in,c_out,i,j,k] = k_x[c_in, c_out,i] * k_y[c_in, c_out,j] k_z[c_in, c_out,k]`

.

In principle, the fact that the kernel is separable should lead to a significant speed up since:

`L[c_out, x,y,z] = \sum_{c_in} \sum_{i} \sum_{j} \sum_{k} M[c_in, x+i, y+j, z+k] k_x[c_in, c_out,i] * k_y[c_in, c_out,j] k_z[c_in, c_out,k]`

This entails that:

`L[c_out, x,y,z] = \sum_{c_in} \sum_{i} k_x[c_in, c_out,i] \sum_{j} k_y[c_in, c_out,j] \sum_{k} M[c_in, x+i, y+j, z+k] k_z[c_in, c_out,k]`

`L[c_out, x,y,z] = \sum_{c_in} \sum_{i} k_x[c_in, c_out,i] \sum_{j} k_y[c_in, c_out,j] Lz[c_out, c_in, x+i, y_j, z]`

`L[c_out, x,y,z] = \sum_{c_in} \sum_{i} k_x[c_in, c_out,i] Ly[c_out, c_in, x+i, y, z]`

with

`Lz[c_out, c_in, x+i, y+j, z] = \sum_{k} M[c_in, x+i, y+j, z+k] k_z[c_in, c_out,k]`

`Ly[c_out, c_in, x+i, y, z] = \sum_{j} k_y[c_in, c_out,j] Lz[c_out, c_in, x+i, y_j, z]`

I cannot construct Ly in memory because it would be too large (32x32x128x128x128). In order to minimize global memory reads I am convolving `M[c_in = i]`

with `k_z`

and `k_y`

and then add this to `L`

at the appropriate locations but this leads to a significant overhead in global memory writes.

This seems to be so basic which makes me believe (or gives me hope) that this must have been implemented somewhere. Has anyone achieved this?