VPI Gaussian Blur Max Kernel Size Limitation?

Hi, VPI library looks great but why does it have a max kernel size of 11x11? I am hoping to get kernel size of 64x64, which is supported by opencv. Why can’t VPI do higher than 11x11 and what are work arounds?

https://docs.nvidia.com/vpi/algo_gaussian_filter.html

I am just guessing but having worked on CUDA code for batched small matrix operations, the 11x11 limit looks like an implementation artifact. There are certain implementation styles conducive to high performance that only work up to 11x11 or 12x12 matrices; beyond that one needs to choose an entirely different approach. In effect a different kernel implementation.

You could file a feature request with NVIDIA to add support for larger kernels and wait for an indeterminate amount of time to see whether it will be implemented. You could also roll up your sleeves and program a larger kernel yourself.

I am curious: What kind of use case requires a 64x64 kernel for Gaussian blur?

I was thinking just down sample the image size too… then a kernel size of 11 would be relatively larger.

I simply want a lot of blur and 11x11 is limiting. I don’t think I can set sigma high enough to make up for the small kernel size. Or can high sigma counter-act a small kernel to effectively give a lot of blur?

It’s a hardcoded limit.

It’s a hardcoded limit. I wouldn’t be able to explain details of internal library implementation.

  1. Use OpenCV. There may be a GPU-accelerated variant there also
  2. Use NPP (but its limited to a max of 15x15)
  3. Write your own CUDA kernel - there is a gaussian blur CUDA sample code
  4. file a bug to state your interest in larger kernel sizes
1 Like

I am not an image-processing expert. In my limited experience, increasing sigma for a gaussian blur filter does increase blur, but how the effect of a small kernel with high sigma visually compares to a larger kernel with lower sigma I can’t say. For questions like these, it is often easiest to simply check whether the result with those settings is acceptable for the intended purpose or not. In other words, run some targeted experiments instead of speculating.

This is a near perfect answer, but there must be some reason why such small kernels are available? In general do GPUs have an inherent limitation? Or is it pure coincidence? I understand NVidia is proprietary, but mathematically could there be a justification for using smaller kernels? Just curious from a hardware perspective is all…

As I said, I am not an image-processing expert. But from what I understand use cases requiring relatively small kernel (windows) are common in filtering tasks, while use cases requiring relatively large kernels are rare.

There is nothing that prevents the implementation of Gaussian blur with kernel size of 64x64 (or any other upper limit you chose) on GPUs, but – by analogy with matrix multiplication – it seems unlikely that one particular implementation approach will result in high-performance implementation across all kernel sizes. The general matrix-matrix multiply GEMM is actually dozens of different code paths under the hood because of that, selected based on matrix size, matrix aspect ratio, transpose modes, and GPU architecture.

So I suspect that the same applies to the size of box filters, and that the library makers focused their initial efforts on the most common use cases, which strikes me as the most suitable approach when there are limited engineering resources. The more customers let NVIDIA know that larger kernel sizes are desired (by filing a feature request), the higher the chance such support will materialize within a reasonable time frame.

1 Like

Thanks, my curiosity stems from wondering if I’m abusing gaussian blur - like as in using it to the extent at which I may be causing stability or performance concerns to the point where not even NVidia recognizes it as a valid use-case.

But just like in e.g. photoshop, if I want to blur a huge resolution image, I will need significantly large kernel, but what is not intuitive is that the underlying code would have different implementations based on kernel size - thanks for that. We’ll never know unless NVidia becomes open source, but good awareness just the same.

Generally speaking, historically and by observation, NVIDIA library development is responsive to customer feedback, so if additional functionality is desired, filing a feature request (that includes where and why the desired feature is important) is the way to go. Obviously this will not guarantee the appearance of a particular feature at a particular time.

You can also search the internet for open-source alternatives. GitHub is often a good starting point.

Thanks, I started with openCV but cuda bindings were not available for its gaussian blur, so hence my entry into NVidia’s offering.

I did file a feature request… hopefully correctly!: https://developer.nvidia.com/nvidia_bug/3406148

I appreciate your feedback. NVidia is a good community.

Since it is open-source software, you could add any desired functionality currently missing to the code base yourself and contribute it back to the open-source community.

This question and answer on Image Processing Stackexchange may be helpful regarding kernel size versus sigma:

Gaussian Blur - Standard Deviation, Radius and Kernel Size

1 Like

I would echo some things that njuffa said. The GPU is a fairly flexible processor so nearly anything (certainly including your large filter kernel sizes) should be possible. If that were all I intended to do, I’m pretty sure I could whip up a CUDA kernel to do that in a reasonably efficient way in a few hours.

I don’t generally have access to the source code of these libraries. If I were to speculate, it would be that:

  1. smaller filter sizes are easier to implement “optimally” i.e. with confidence that the work is high performance.
  2. There is not enough perceived demand to implement the larger kernel cases, with the additional coding complexity implied.

CUDA threadblocks are limited to 1024 threads that can collectively work together using the “best” GPU features available for this purpose. Furthermore, shared memory which is also probably relevant for a “best” implementation, is a limited resource.

It’s fairly easy to write “optimal” code that arranges work in 32 x 32 tiles, one thread per element. Such a tile can probably get reasonable data reuse with filter sizes up to 11x11, and furthermore the shared memory usage is such that it probably doesn’t become a serious occupancy concern at these scales. Detailed understanding of these ideas requires some experience actually trying to write this code yourself, but the basic pattern is fairly common and self evident. You can find plenty of examples of 2D stencil CUDA kernels with a bit of searching.

Therefore I suspect it may just be the 80/20 rule. 80% of the uses cases can be solved using 20% of the worst case code - and that may be what the library implementers decided to implement.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.