Could a GPU significantly speed up this image processing task?

speedymcs · September 17, 2018, 8:39am

Hi, I’m not experienced in GPU programming and I’m wondering if it makes sense for me to get into it (and buy hardware).

I work on a resource-intensive image processing code. I regularly need to perform this convolution a couple of thousand times for a large (~40 million double entries) matrix M:

DFT(M)
fftshift(M)
M ⊙ [convolution kernel]
ifftshift(M)
iDFT(M)

M is the same for each computation while the kernel always changes. I have recently introduced multi-threading so that each CPU core works the above sequence at each time, but I’m wondering if I could get faster with a GPU.

Right now, the forward and backward Fourier transforms take up the most time by far at just under 4 seconds each, using OpenCV’s dft function.

cbuchner1 · September 17, 2018, 9:40am

couldn’t you simply use this as a drop-in and enjoy a speed boost with CUDA capable hardware?

https://docs.opencv.org/3.4/dc/de5/classcv_1_1cuda_1_1DFT.html

Christian

saulocpp · September 20, 2018, 9:45am

Just to complement what cbuchner1 said, and without knowing anything about OpenCV or your hardware, I have the cuFFT running on a 1080Ti and the high resolution timer will only start registering something when my input data is bigger than 200M floats (about 760MB).
And just for a fair comparison, Intel’s MKL FFT running on my CPU (4-core i5 6400) is faster on a bigger dataset (subsecond on 512MB).
You can get a 1080Ti for manageable prices now.

speedymcs · September 20, 2018, 2:25pm

Thanks! I spent the last days making Christian’s suggestion work and today I succeeded, after installing (the right) CUDA, recompiling OpenCV and dealing with the new data types etc.

I’m very satisfied with the DFT speedup. Provided that clock_gettime(CLOCK_MONOTONIC) gives meaningful results here, each DFT takes something like 0.6s on average, compared to almost 5s on the CPU.

But the DFT is only one task of the above algorithm, and since the others are still running on the CPU, they still take up a good amount of time. I tried running the new algorithm in multiple threads (and CPU cores), but even two threads provoke a GPU out of memory error. My machine has a Q5000 graphics card with apparently 2.5GB memory. My question now is, if I invested in maybe a 1080Ti with more memory, would it be trivial to run multiple threads on it that each perform DFTs?

saulocpp · September 21, 2018, 8:33am

I’m not sure it is the number of threads causing this out of memory error… without knowing OpenCV it is difficult for me to provide any plausible guess.

But talking specifically about the 1080Ti, it will be a huge leap over the Q5000 (Kepler, I believe?).
With cuFFT you don’t specify the launch configuration as it is being done “under the hood”, so it probably finds optimal settings for the current device. You may want to read this post, where one of my questions is exactly this, and txbob provides ample information:

[url]https://devtalk.nvidia.com/default/topic/1037667/internal-details-limitations-of-cufft-general-questions/?offset=1#5271797[/url]

The DFT is (now) of least concern to me as the API takes care of this, you just have to provide correct arguments. As for the rest of the computation between the forward and inverse DFTs, you would have to rewrite using the CUDA API to benefit from such an upgrade.

Maybe cbuchner1 or someone else wants to step in and give feedback, as there will definitely have a time cost and learning curve to migrate to a different parallel architecture. To get the most out of these devices (and your money), one needs to use their native API.

speedymcs · September 21, 2018, 1:32pm

Maybe the question is too specific to be answered here. Agreed, the cleanest would be to implement each operation properly with CUDA. I’ll see how far I can get with OpenCVs built-in functions, consider learning how to apply CUDA directly and I figure I can always order a 1080, try multithreading and return it if it doesn’t work…

Topic		Replies	Views
OpenCV dft vs. gpu::dft Performance GPU-Accelerated Libraries opencv	6	4336	July 8, 2017
Transfer data CPU/GPU is an issue.. Jetson TX2	8	1869	October 18, 2021
Cross-Correlation with CUFFT CUDA Programming and Performance	6	6984	August 14, 2009
OpenCV Cuda DFT extremely slow CUDA Programming and Performance	4	2379	November 24, 2021
Real Time image Processing CUDA CUDA Programming and Performance	6	7161	May 8, 2012
How can I get good performance from cuFFT? GPU-Accelerated Libraries	2	1379	June 8, 2016
CUDA OpenCV questions CUDA Programming and Performance	7	2464	November 30, 2010
Does cufft show much higher efficiency than cpu fft routines? CUDA Programming and Performance	10	9039	July 19, 2010
Advice on porting to an HPC application to GPU nvc, nvc++ and nvfortran	6	63	July 30, 2024
How to show CuFFT routines show higher performance than normal MATLAB fft() in terms of time taken. CUDA Programming and Performance	13	3156	July 10, 2014

Could a GPU significantly speed up this image processing task?

Related topics