Can't replicate reported speed on Geforce RTX 2080 Ti

I tried to run the sample AppOFCuda program with my 720p video, but the frame rate I measured only got up to 70 FPS. I used the fast preset and disabled the output download. This result is far below the reported speed of 120 FPS on a 4K video. Is there anything I should configure to make it match the reported speed?

From my experience, the implementation of file reading in the sample is quite slow. I got some speed up when I switched to using a different library to read images. But I’ve also struggled to get anywhere near the performance reported for this technology.

Thanks for you reply, Edward! I’ll try using opencv for reading images. Also, can any of the Nvidia developers respond to this please?

Hi.
The samples shipping in the Optical Flow SDK are not optimized for performance. We suggest you to follow the guidelines mentioned in chapter 8 of NVOFA_Programming_Guide.pdf to achieve better performance. Especially, try to follow suggestion #2 and suggestion #5.

We also plan to add a few samples optimized for performance in one of our upcoming SDK releases.

Thanks!

Hi, thanks for your reply.
Specify the flag --useCudaStream=1 for the sample program should enable different streams for input and output processing right?
Also, the guide is ambiguous, how should I load frames into a buffer of size 4? Do I copy the current frame to the next buffer so that I can use it as the reference for the next iteration? Is that what it means by not reusing the buffers? I got no speedup when I did this.

Hi.

Sorry for slow response.
If you are using SDK sample application, even if with a different library to load PNGs, that won’t help much to achieve high performance.

You are likely following these steps to measure the performance:

1.	Start timer
2.	File-read to sysmem
3.	Sysmem to vidmem upload
4.	Start hw
5.	Vidmem to sysmem download
6.	Sysmem to file write
7.	Stop timer.
8.	Go to step 1

Above steps break the pipeline, even with different CUDA streams.

For performance measurement, you need to do something like below
(assume you have N executions in a batch):

1.	Allocate N+1 input and N output buffers
2.	Read N+1 frames to sysmem
3.	Upload sysmem to vidmem
4.	Cuda input stream sync (block the CPU until above operations are completed)
5.	Start timer
6.	Kickoff N executions
7.	Cuda output stream sync (block the CPU until hw is done)
8.	Stop timer
9.	Download output from vidmem to sysmem
10.	Sysmem to file write
11.	Go to step 1

Idea here is to either minimize or pipeline PCIe transfers of raw images, since they require large bandwidth and become a bottleneck in achieving high performance.

Hope this helps.

Thanks.