Can anyone point me to non-separable convolution examples? I am also especially interested in bigger size kernels.
For larger kernels (especially), you’ll want to do the convolution in the frequency domain. There’s an example of this in the SDK, which uses the CUFFT library.
The CUFFT documentation also includes simple examples of how to do FFTs in 1, 2 or 3 dimensions.
Except if I want to assess the tradeoff between multiple parallel threads of spatial convolution working on different portions of the image, vs the freq domain you mention.