why is FarnebackOpticalFlow (gpu) slower than calcOpticalFlowFarneback (cpu) ?

Dear All,

I tried to use FarnebackOpticalFlow (gpu version) to replace the calcOpticalFlowFarneback (cpu version),
but the result shows that the OpticalFlow runs slower on gpu than on cpu

I have tried to call the FarnebackOpticalFlow gpu function secondly, it still takes about 110 ms.

the result is unexpected, and is there anything I missed?

thanks in advance!

my environment:
I7/Invida Geforce G705 (48 core)/16G memory
opencv 2.3.14

for FarnebackOpticalFlow gpu version, it takes about 120 ms
for calcOpticalFlowFarneback cpu version, it takes about 80 ms

the following is code snippet:

void GetOptFlow_gpu(Mat framePre, Mat frameCur, Mat & flow, Mat & cflow)
{
	GPU_PERFORMANCE_TRACE_BEGIN_GET_OPT_FLOW()
	
	//calcOpticalFlowFarneback(framePre, frameCur, flow, 0.5, 3, 15, 3, 5, 1.2, 0);
	Mat frame0 = framePre;
	Mat frame1 = frameCur;

	GpuMat d_frame0(frame0);
        GpuMat d_frame1(frame1);

        GpuMat d_flowx(frame0.size(), CV_32FC1);
        GpuMat d_flowy(frame0.size(), CV_32FC1);

	Mat flowx, flowy;
	FarnebackOpticalFlow farn;
	{
#if 1
	farn.pyrScale = 0.5;
        farn.winSize = 15;
        farn.numIters = 3;
        farn.polyN = 5;
        farn.polySigma = 1.2;
        farn.flags = 0;
#else
		farn.pyrScale = 0.5;
        farn.fastPyramids = false;
        farn.winSize = 13;
        farn.numIters = 10;
        farn.polyN = 5;
        farn.polySigma = 1.1;
        farn.flags = 0;
#endif
		{
			const int64 start = getTickCount();
			farn(d_frame0, d_frame1, d_flowx, d_flowy);
			const double timeSec = (getTickCount() - start) / getTickFrequency();
			cout << "Farn : " << timeSec << " sec" << endl;
		}

		{
			const int64 start = getTickCount();
			farn(d_frame0, d_frame1, d_flowx, d_flowy);
			const double timeSec = (getTickCount() - start) / getTickFrequency();
			cout << "Farn : " << timeSec << " sec" << endl;
		}

    }

	d_flowx.download(flowx);
        d_flowy.download(flowy);

	vector<Mat> vecMat;
	vecMat.push_back(flowx);
	vecMat.push_back(flowy);
	merge(vecMat,flow);
	
	GPU_PERFORMANCE_TRACE_END_GET_OPT_FLOW()

        cvtColor(framePre, cflow, CV_GRAY2BGR);

        DrawOptFlowMap(flow, cflow, 8, 1.5, CV_RGB(0, 255, 0));
}

in addition, the video frame resolution is 640x480

You don’t specify exactly what kind of Intel i7 CPU you have in your system, but the Geforce GT 705 is a very low-end GPU judging by its specification (http://www.geforce.com/hardware/desktop-gpus/geforce-gt-705-oem/specifications). Its memory throughput for example is, at 14.4 GB/sec, lower than the memory throughput of ordinary system memory, typically 25 GB/sec).

So I think it is very likely that your host system is simply faster than your GPU, which explains the poor performance of the GPU-accelerated code.

thank you, njuffa,

The PC platform is based on CPU(Intel I7-4790 3.6GHz), it is true that the GT705 throughput is up to 10GB/s,as below output of transpose cuda-toolkit sample, but i eventually want to run the cuda code on TK1 platform, and its throughput is also up to 10GB/s (after changing GPU freq to 852000 Hz), so do you mean that I cannot optimize the opticalflow further with opencv-gpu?

in addition,i have implement some kernal function without using opencv-gpu, and the kernal function run faster than the one for cpu version.so i am wondering whether or not the TK1 platform is not powerful enought to run the opencv-gpu (opticalflow) function?

sorry, i am new to cuda programming,and I want to optimize the calcOpticalFlowFarneback function on TK1 platform, because it takes more than 300ms per video frame (640x480), is it possible to optimze the time-spend to 30ms?

thanks in advance


C:\ProgramData\NVIDIA Corporation\CUDA Samples\v7.5\bin\win64\Release>transpose.exe
Transpose Starting...

GPU Device 0: "GeForce GT 705" with compute capability 2.1

> Device 0: "GeForce GT 705"
> SM Capability 2.1 detected:
> [GeForce GT 705] has 1 MP(s) x 48 (Cores/MP) = 48 (Cores)
> Compute performance scaling factor = 4.00

Matrix size: 512x512 (32x32 tiles), tile size: 16x16, block size: 16x16

transpose simple copy       , Throughput = 10.4264 GB/s, Time = 0.18732 ms, Size = 262144 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose shared memory copy, Throughput = 9.3503 GB/s, Time = 0.20888 ms, Size = 262144 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose naive             , Throughput = 5.8396 GB/s, Time = 0.33446 ms, Size = 262144 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose coalesced         , Throughput = 7.9239 GB/s, Time = 0.24649 ms, Size = 262144 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose optimized         , Throughput = 8.8558 GB/s, Time = 0.22055 ms, Size = 262144 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose coarse-grained    , Throughput = 9.0476 GB/s, Time = 0.21587 ms, Size = 262144 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose fine-grained      , Throughput = 8.9535 GB/s, Time = 0.21814 ms, Size = 262144 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose diagonal          , Throughput = 6.7352 GB/s, Time = 0.28999 ms, Size = 262144 fp32 elements, NumDevsUsed = 1, Workgroup = 256
Test passed

the following is output of transpose on TK1 plaform:

..../cuda-6.5/samples/6_Advanced/transpose$ sudo ./transpose
Transpose Starting...

modprobe: FATAL: Module nvidia not found.
GPU Device 0: "GK20A" with compute capability 3.2

> Device 0: "GK20A"
> SM Capability 3.2 detected:
> [GK20A] has 1 MP(s) x 192 (Cores/MP) = 192 (Cores)
> Compute performance scaling factor = 1.00

Matrix size: 1024x1024 (64x64 tiles), tile size: 16x16, block size: 16x16

transpose simple copy       , Throughput = 9.9408 GB/s, Time = 0.78591 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose shared memory copy, Throughput = 9.2726 GB/s, Time = 0.84254 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose naive             , Throughput = 6.5784 GB/s, Time = 1.18759 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose coalesced         , Throughput = 6.2635 GB/s, Time = 1.24730 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose optimized         , Throughput = 6.2606 GB/s, Time = 1.24788 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose coarse-grained    , Throughput = 6.2304 GB/s, Time = 1.25392 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose fine-grained      , Throughput = 8.6643 GB/s, Time = 0.90168 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose diagonal          , Throughput = 5.5581 GB/s, Time = 1.40562 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
Test passed

The i7-4790 @ 3.6GHz is near the upper end of the CPU performance spectrum, the Geforce GT 705 at the very lowest end of the GPU performance spectrum. The performance characteristics of particular functions in OpenCV may be better suited to CPU performance characteristics and CPU-style parallelism, while other functions are a better match for GPU-style parallelism.

I have no insights into amount of optimization work that has gone into the respective CPU and GPU variants for OpenCV. It could imagine that it varies considerably between particular functions. When you compare to your own custom version, the performance differences observed could be due to mediocre optimization in OpenCV, or due to the fact that your custom version only has to support a subset of the functionality (or both). I can’t tell which scenario applies here.

Keep in mind that the CPU inside the TK1 is not nearly as powerful as a i7-4790, so shifting the work to the GPU for that platform does seem like the correct approach. The TK1 is an integrated, low power embedded platform, comparing its performance to a high-power PC with a discrete graphics card does not really make sense to me. I would suggest focusing performance work on the TK1 sooner, rather than later, to get a better feel for the performance bottlenecks on the intended target platform.

You might need a faster GPU optical flow method (different approach than Farneback). The ‘Folki’ optical flow methods (http://ieeexplore.ieee.org/document/1529706/) seems to be really fast. See http://www.onera.fr/en/node/1347?page=1 and http://www.nvidia.com/content/GTC/posters/23_Plyer_FOLKI_GPU.pdf . I think one can download the CUDA code from the first link.
Another possibility would be to calculate the optical flow at half resolution (320x240) and scale it up.

hi guys,thanks for your answer!

I have tried to run the opticalflow code with opencv-gpu on TK1 platform, it is slow, and more than 300 ms.

if run the opticalflow code without opencv gup version, it also takes about 300 ms.

the result is unexpected that the gpu version run event slower than the one on TK1 plaform.

I am also wondering why the Geforce GT 705 on PC platform is powerer than GPU on TK1 platform? because the it takes about 120 ms with the Geforce GT 705 on PC platform!!!

is there anything i missed on TK1?

thanks, hannesf99, i will have a look at the ‘Folki’ optical flow methods you refer to.

thanks in advance!


Farn : 0.319058 sec
GetOptFlow_gpu,time spent executing by the GPU: 347.15,by CPU in CUDA calls: 347.42,CPU executed 41 iterations while waiting for GPU to finish

The TK1 and the GT 705 utilize different architectures. I would assume they also run their cores at different operating frequencies, but have not checked. So I don’t think we can make any straightforward comparison of their computational throughput, it is more of an apples-to-oranges comparison. This goes to my point that performance considerations are best based on experiments on the TK1 itself, rather than a stand-in.

As far as the available memory bandwidth is concerned, both GPUs seem to perform quite similar based on the data from the transpose benchmark in #4. That is probably due to the fact that both use DDR3-based memory subsystems.

thank you, njuffa,

I understand what you said that it doesn’t make sense to compare the GPU on TK1 with the GPU on PC platform.

But i really want to find out how to improve the performance of the opencv opticalflow on TK1 with gpu, i think it is too slow to take 300ms as long as without gpu (it takes also about 300ms to run opticalflow without gpu-version).

i just want to confirm if it is normal to take so long to call the opticalflow (gpu version) on TK1, or if there is something wrong i missed?

in addition, i found it take about 100ms for the secondary call as below show, i also notice that it runs a little fast today (200ms) than yesterday (300ms)


RadioClassifier_gpu,time spent executing by the GPU: 7.66,by CPU in CUDA calls: 7.72,CPU executed 51 iterations while waiting for GPU to finish
Farn : 0.220757 sec
Farn : 0.108355 sec
Farn : 0.0879475 sec

is it related to the PTX code? if i can warm the gpu by compiling the cuda code without ptx?

also, is there any other opticalflow method which performance is higher? the link that hannesf99 mentioned (#6) doesn’t have gpu source code.

in addition, what is the difference about the optical_flow samples in opencv source code? is opticalflow_nvidia_api faster than optical_flow?

samples/gpu/opticalflow_nvidia_api.cpp
samples/gpu/optical_flow.cpp

sorry for my poor english, and hope you could understand what i concerned.

thanks a lot
-zhi

I have never worked with OpenCV or the TK1, so I do not have any tips how one might improve performance (for example, by choosing different functions from the ones you are using, as suggested by HannesF99 above).

The CPU and GPU in a TK1 share the same physical DRAM, so if the task at hand is memory bound (something you should be able to confirm/refute with the help of the profiler), it seems entirely plausible that the performance for OpenCV’s CPU and GPU variants of that task would be approximately the same.

A “magical” speedup observed after the first iteration may point to a flawed performance-measurement methodology, in that one-time startup overhead (e.g. CUDA context creation, PTX JIT compilation) gets included in the measurements, which should be avoided. If you cannot change the measurement framework, focus on steady-state performance after a warmup phase. To avoid PTX JIT compilation make sure that all CUDA code is compiled with the -arch flag appropriate for the TK1. I am reasonably sure the TK1 has compute capability 3.2, so use compute_32, sm_32.

BTW, there is a dedicated TK1 forum “next door” (https://devtalk.nvidia.com/default/board/162/jetson-tk1/), your chances of getting advice on performance tuning for that specific platform are probably better over there.

Download Link is here: http://www.onera.fr/en/node/1343 (application plus Flow Code in One zip).
The popular flow algorithms (TV-L1, Brox, Lucas-Kanade …) are supposedly memory bound, cause for a pixel typically just a few arithmic ops are done. To enhance performance you should additionally try to optimize the parametrization (less Iterations, …) and calculate the flow in the lowest possible resolution which still gives good results for your application.

hi hannesf99,

i have download the zip from the link, but it only contains
FOLKI_SPIV_W7_CUDA3.1_demo.msi
setup_W7_CUDA3.1_demo.exe

after running FOLKI_SPIV_W7_CUDA3.1_demo.msi,some binary components are installed under C:\Program Files\Onera\Folki_W7_CUDA3.1


CUDA_AFIX.dll
CUDA_DAFE.dll
Folki_PIV.exe

but there is no cuda source code i can find.

ok, I see. I downloaded it 5 years ago, at that time source code was included. I sent you a private message with my email. Please reply so that I have your email, and I will provide you the package like I downloaded it.

hi HannesF99,

thank you again!

i just got the cuda code you email and will have a research …

in addition, after changing the freq of gpu on TK1 board from 72000 khz to 852000 khz, the opticalflow gpu version takes about 130 ms, it is about 300ms before.

but the 130ms is still very slow for realtime system on TK1, so i will have a look at the Folki method, based on its article, it takes only 30ms per frame.