Optimizing color channels in image processing (Gaussian blur)

dorianiaconu · June 11, 2017, 3:15pm

Hi!

If you were unfortunate enough, you might have seen my thread last week where I posted about the lack of speed regarding a Gaussian blur function.

All in all, it was also the code’s fault (apart from my old GPU). I managed to work on it and reduce the time from 1800 milliseconds to 650 milliseconds. The thing is that I need more speed. OpenCV runs a serial Gaussian filtering at 360 milliseconds.

The Gaussian kernel I’m using now: https://paste.ofcode.org/rbT9nmGUDtYBTaFbmt9hZS

So instead of using 3 channels to do operations every time, I was thinking about using the uint3 structure. To create a matrix that has 9 positions and on every position to have 3 components (like on position 1 to have R1G1B1), but I am not really familiar with this. I’d appreciate if someone could give me a hand with this matter.

I hope this way I will reduce the volume of calculus that is used.

BulatZiganshin · June 11, 2017, 4:23pm

i think that g.blur should be limited by memory rarther than amount of calucations. you can check it from profiling or just by measuring overall processing speed, which is probably close to the vram speed

dorianiaconu · June 11, 2017, 7:20pm

Then could you help me try to implement this algorithm with shared memory? I had a lot of things to take care of before starting this project and I kind of dropped the ball in the sense of not being able to study extensively for it.

I tried to follow this: c - When is CUDA's __shared__ memory useful? - Stack Overflow

And this: Using Shared Memory in CUDA C/C++ | NVIDIA Technical Blog

And my result was this: https://paste.ofcode.org/zhQ775hni4UCupKe5CtFhA

But it seems I get no output and no speed-up. So I’m sure I’m doing something wrong.

I’d appreciate any help/advice.

HannesF99 · June 13, 2017, 7:48am

For Kepler architectures or later, the principles mentioned in paper [1] work nicely in my experience for small convolution kernels.

load from texture path (you can skip border handling also if you use texture objects instead of _ldg instruction)
process multiple elements per thread (usually, a small ‘tile’ of 2 - 4 elements are enough)
segment CUDA kernel into 3 main phases (‘register blocking’): load tile (for source image) into register array - processing of tile (convolve) - save tile result (register array) to global memory
Note that no shared memory is used (which might bring additional performance advantages in Volta). You can bind a uint3 (interleaved) image to a texture object with some tricks (note then you have to implement border handling by yourself). Calculations should be done using ‘float’ datatype instead ‘int’.
For bigger convolution kernels one can split up the convolve kernel e.g. in two parts and apply the same strategy (two times).

[1] http://www.forrestiandola.com/PREPRINT_convolution_2d_gpu_registers.pdf
code: GitHub - forresti/convolution: Communication-Minimizing 2D Convolution in GPU Registers

dorianiaconu · June 18, 2017, 1:32pm

Thank you for the post. Before reading it I tried yet another way to do the convolution.

Sadly there are quite some errors there, can you please take a look?

https://paste.ofcode.org/cBcUkhpxHFHGDAufRPg3dB

Errors are:

Error MSB3721 The command ““C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\bin\nvcc.exe” -gencode=arch=compute_50,code="sm_50,compute_50" --use-local-env --cl-version 2015 -ccbin “C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\bin\x86_amd64” -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\include” -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\include" --keep-dir x64\Release -maxrregcount=0 --machine 64 --compile -cudart static -DWIN32 -DWIN64 -DNDEBUG -D_CONSOLE -D_MBCS -Xcompiler “/EHsc /W3 /nologo /O2 /FS /Zi /MD " -o x64\Release\kernel.cu.obj “D:\Licenta\CUDATest\CUDATest\kernel.cu”” exited with code 2. CUDATest C:\Program Files (x86)\MSBuild\Microsoft.Cpp\v4.0\V140\BuildCustomizations\CUDA 8.0.targets 689

Error no instance of overloaded function “GaussianBlur” matches the argument list CUDATest D:\Licenta\CUDATest\CUDATest\kernel.cu 305

Error (active) expected an expression CUDATest d:\Licenta\CUDATest\CUDATest\kernel.cu 305

Edit:

It was a typo issue. Now I got no output and it’s slower than before.

Even later edit:

It’s working now. But it’s slower. I tried to implement shared memory and got this:

cudaDeviceSynchronize returned error code 77

Which is related to this section:

// cudaDeviceSynchronize waits for the kernel to finish, and returns
// any errors encountered during the launch.
cudaStatus = cudaDeviceSynchronize();
if (cudaStatus != cudaSuccess) {
	fprintf(stderr, "cudaDeviceSynchronize returned error code %d after launching GaussianBlur!\n", cudaStatus);
	goto Error;
}

This is the version of code with the “new” and “clean” convolution kernel with errors related to shared memory: https://paste.ofcode.org/QGpvRRUT6gYJdBD5hu8ysT

Sadly, this “clean” and “new” version is A LOT slower than my “messy” first version. The new one scores around 2500-3000ms. The old one scores around 650ms.

Here I was trying to implement shared memory on the old one: https://paste.ofcode.org/TnsKqEixDD4htjfHG2Qa2z

On the old one, the same error code 77 issue is present. Can someone help me to fix this thing?

Topic		Replies	Views
Kernel Convolution with streams provides no benefit CUDA Programming and Performance	4	37	January 20, 2025
How would you do this? CUDA Programming and Performance	12	4467	August 5, 2008
The GPU utilization is low CUDA Programming and Performance	3	2033	November 14, 2014
Cuda Latency problems Slow Cuda CUDA Programming and Performance	15	13934	September 5, 2008
Cuda demosaicing running slowly CUDA Programming and Performance	7	5692	November 16, 2011
Optimizing memory coalescence doesn't make my program faster CUDA Programming and Performance	12	488	August 6, 2021
Converting a kernel from floats and ints to halfs is 6x slower CUDA Programming and Performance cuda	14	1031	October 16, 2023
Texture memory fetch extremely slow CUDA Programming and Performance	13	3125	December 21, 2017
Optimization of kernel for batch convolution of many small matrices CUDA Programming and Performance	4	1743	August 1, 2013
Speed comparison between CUDA and OpenCV CUDA Programming and Performance opencv	2	3770	June 6, 2017

Optimizing color channels in image processing (Gaussian blur)

Related topics