Could a GPU significantly speed up this image processing task?

Hi, I’m not experienced in GPU programming and I’m wondering if it makes sense for me to get into it (and buy hardware).

I work on a resource-intensive image processing code. I regularly need to perform this convolution a couple of thousand times for a large (~40 million double entries) matrix M:

M ⊙ [convolution kernel]

M is the same for each computation while the kernel always changes. I have recently introduced multi-threading so that each CPU core works the above sequence at each time, but I’m wondering if I could get faster with a GPU.

Right now, the forward and backward Fourier transforms take up the most time by far at just under 4 seconds each, using OpenCV’s dft function.

couldn’t you simply use this as a drop-in and enjoy a speed boost with CUDA capable hardware?


Just to complement what cbuchner1 said, and without knowing anything about OpenCV or your hardware, I have the cuFFT running on a 1080Ti and the high resolution timer will only start registering something when my input data is bigger than 200M floats (about 760MB).
And just for a fair comparison, Intel’s MKL FFT running on my CPU (4-core i5 6400) is faster on a bigger dataset (subsecond on 512MB).
You can get a 1080Ti for manageable prices now.

Thanks! I spent the last days making Christian’s suggestion work and today I succeeded, after installing (the right) CUDA, recompiling OpenCV and dealing with the new data types etc.

I’m very satisfied with the DFT speedup. Provided that clock_gettime(CLOCK_MONOTONIC) gives meaningful results here, each DFT takes something like 0.6s on average, compared to almost 5s on the CPU.

But the DFT is only one task of the above algorithm, and since the others are still running on the CPU, they still take up a good amount of time. I tried running the new algorithm in multiple threads (and CPU cores), but even two threads provoke a GPU out of memory error. My machine has a Q5000 graphics card with apparently 2.5GB memory. My question now is, if I invested in maybe a 1080Ti with more memory, would it be trivial to run multiple threads on it that each perform DFTs?

I’m not sure it is the number of threads causing this out of memory error… without knowing OpenCV it is difficult for me to provide any plausible guess.

But talking specifically about the 1080Ti, it will be a huge leap over the Q5000 (Kepler, I believe?).
With cuFFT you don’t specify the launch configuration as it is being done “under the hood”, so it probably finds optimal settings for the current device. You may want to read this post, where one of my questions is exactly this, and txbob provides ample information:

The DFT is (now) of least concern to me as the API takes care of this, you just have to provide correct arguments. As for the rest of the computation between the forward and inverse DFTs, you would have to rewrite using the CUDA API to benefit from such an upgrade.

Maybe cbuchner1 or someone else wants to step in and give feedback, as there will definitely have a time cost and learning curve to migrate to a different parallel architecture. To get the most out of these devices (and your money), one needs to use their native API.

Maybe the question is too specific to be answered here. Agreed, the cleanest would be to implement each operation properly with CUDA. I’ll see how far I can get with OpenCVs built-in functions, consider learning how to apply CUDA directly and I figure I can always order a 1080, try multithreading and return it if it doesn’t work…