Optimizing color channels in image processing (Gaussian blur)

Hi!

If you were unfortunate enough, you might have seen my thread last week where I posted about the lack of speed regarding a Gaussian blur function.

All in all, it was also the code’s fault (apart from my old GPU). I managed to work on it and reduce the time from 1800 milliseconds to 650 milliseconds. The thing is that I need more speed. OpenCV runs a serial Gaussian filtering at 360 milliseconds.

The Gaussian kernel I’m using now: https://paste.ofcode.org/rbT9nmGUDtYBTaFbmt9hZS

So instead of using 3 channels to do operations every time, I was thinking about using the uint3 structure. To create a matrix that has 9 positions and on every position to have 3 components (like on position 1 to have R1G1B1), but I am not really familiar with this. I’d appreciate if someone could give me a hand with this matter.

I hope this way I will reduce the volume of calculus that is used.

i think that g.blur should be limited by memory rarther than amount of calucations. you can check it from profiling or just by measuring overall processing speed, which is probably close to the vram speed

Then could you help me try to implement this algorithm with shared memory? I had a lot of things to take care of before starting this project and I kind of dropped the ball in the sense of not being able to study extensively for it.

I tried to follow this: c - When is CUDA's __shared__ memory useful? - Stack Overflow

And this: Using Shared Memory in CUDA C/C++ | NVIDIA Technical Blog

And my result was this: https://paste.ofcode.org/zhQ775hni4UCupKe5CtFhA

But it seems I get no output and no speed-up. So I’m sure I’m doing something wrong.

I’d appreciate any help/advice.

For Kepler architectures or later, the principles mentioned in paper [1] work nicely in my experience for small convolution kernels.

  • load from texture path (you can skip border handling also if you use texture objects instead of _ldg instruction)
  • process multiple elements per thread (usually, a small ‘tile’ of 2 - 4 elements are enough)
  • segment CUDA kernel into 3 main phases (‘register blocking’): load tile (for source image) into register array - processing of tile (convolve) - save tile result (register array) to global memory
    Note that no shared memory is used (which might bring additional performance advantages in Volta). You can bind a uint3 (interleaved) image to a texture object with some tricks (note then you have to implement border handling by yourself). Calculations should be done using ‘float’ datatype instead ‘int’.
    For bigger convolution kernels one can split up the convolve kernel e.g. in two parts and apply the same strategy (two times).

[1] http://www.forrestiandola.com/PREPRINT_convolution_2d_gpu_registers.pdf
code: GitHub - forresti/convolution: Communication-Minimizing 2D Convolution in GPU Registers

Thank you for the post. Before reading it I tried yet another way to do the convolution.

Sadly there are quite some errors there, can you please take a look?

https://paste.ofcode.org/cBcUkhpxHFHGDAufRPg3dB

Errors are:

Error MSB3721 The command ““C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\bin\nvcc.exe” -gencode=arch=compute_50,code="sm_50,compute_50" --use-local-env --cl-version 2015 -ccbin “C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\bin\x86_amd64” -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\include” -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\include" --keep-dir x64\Release -maxrregcount=0 --machine 64 --compile -cudart static -DWIN32 -DWIN64 -DNDEBUG -D_CONSOLE -D_MBCS -Xcompiler “/EHsc /W3 /nologo /O2 /FS /Zi /MD " -o x64\Release\kernel.cu.obj “D:\Licenta\CUDATest\CUDATest\kernel.cu”” exited with code 2. CUDATest C:\Program Files (x86)\MSBuild\Microsoft.Cpp\v4.0\V140\BuildCustomizations\CUDA 8.0.targets 689

Error no instance of overloaded function “GaussianBlur” matches the argument list CUDATest D:\Licenta\CUDATest\CUDATest\kernel.cu 305

Error (active) expected an expression CUDATest d:\Licenta\CUDATest\CUDATest\kernel.cu 305

Edit:

It was a typo issue. Now I got no output and it’s slower than before.

Even later edit:

It’s working now. But it’s slower. I tried to implement shared memory and got this:

cudaDeviceSynchronize returned error code 77

Which is related to this section:

// cudaDeviceSynchronize waits for the kernel to finish, and returns
// any errors encountered during the launch.
cudaStatus = cudaDeviceSynchronize();
if (cudaStatus != cudaSuccess) {
	fprintf(stderr, "cudaDeviceSynchronize returned error code %d after launching GaussianBlur!\n", cudaStatus);
	goto Error;
}

This is the version of code with the “new” and “clean” convolution kernel with errors related to shared memory: https://paste.ofcode.org/QGpvRRUT6gYJdBD5hu8ysT

Sadly, this “clean” and “new” version is A LOT slower than my “messy” first version. The new one scores around 2500-3000ms. The old one scores around 650ms.

Here I was trying to implement shared memory on the old one: https://paste.ofcode.org/TnsKqEixDD4htjfHG2Qa2z

On the old one, the same error code 77 issue is present. Can someone help me to fix this thing?