GPU based endoscope feasibility

Hi,

We are currently trying to see if it is feasible to built a GPU based endoscope. The sensing part includes a 1080p 60fps CMOS mipi CSI-2 camera. We have to run a set of custom image processing applications for image enhancement and push it to the HDMI interface for display. This whole set of operations needs to be done in under 1 frame latency ie, lees than 16.6 ms. To get an understanding of the time taken by the image processing algorithms we ran the implementation done using openCV C++ libraries and executed in a i7 4510 quadcore processor running at 2 Ghz running ubuntu 14.04. The algorithm on an average took around 300 milliseconds per image. Given below are the timing details of the different aspects of the algorithm for one iteration.
Luminance time :0.0195284 seconds
Log Luminance time :0.00767123 seconds
bilateral filter time :0.245126 seconds
antilog,base img time :0.0022362 seconds
Detail Image time :0.00572883 seconds
Enhancement time :0.00536 seconds
combined img time :0.0041545 seconds
getting coefficients time :0.00538347 seconds
getting output :0.0200575 seconds

total time :0.315247 seconds

We were looking at the option of using TX1 for this application, basically because it supports the camera and display interfaces we require. Before we go ahead and buy the eval board and try this, i wanted to check if this is indeed a feasible option.

I am a newbie in GPUs and we usually go for an FPGA based implementation for such cases. I was looking at the capabilities of these new GPUs and was wondering why not used them instead.

Any help will be highly appreciable.

First let me say I am a Newbie myself and am just learning about the TX1.
But I must first ask:…“Where do you want to shove a Jetson TX1 ?”.
Sorry I could not help myself.

So in looking at your i7 4510 results of 300mSec/frame, you require about a 20X speedup.
However I don’t know if the i7 benchmark you ran is all that relevant as these types of
algorithms can be drastically influenced by tuning your benchmark code, in other words
you may have a lot of room for improvement even before you go to the GPU.

If you are reluctant to spend the $600 for a Jetson TX1 developer kit, consider benchmarking on a low-end Maxwell class Nvidia graphics card instead.

The GTX 950 cards run a slightly faster clock than the TX1 GPU
…(~1.2GHz boost clock on the GTX 950 .vs. 1.0GHz on the TX1)
and has 3X the GPU CUDA cores
…(768 on the GTX 950 .vs. 256 on the TX1)
and 4X the memory bandwidth
…(~105 GBytes/sed on the GTX 950 .vs. ~26 GB/sec on the TX1)

So I would expect a GPU benchmark on the GTX 950 to run almost 4 times (4X) as fast as on the Jetson TX1.
So if you port your program to the GTX 950 and see it running under 4mSec/Frame, then I would expect the Jetson TX1 would meet your 16.6ms frame time performance.

However if your first running of benchmark code on the GTX 950 is not fast enough, be aware that GPU code can easily leave a lot of room for performance improvement by tuning/optimization.

Mark Harris of Nvidia wrote a wonderful presentation “Optimizing Parallel Reduction in CUDA”
: http://developer.download.nvidia.com/compute/cuda/1.1-Beta/x86_website/projects/reduction/doc/reduction.pdf
where he did a step by step optimization of a function to calculate the sum of a list (array) of numbers.
In this presentation he coded an initial solution and refined (tuned/optomized) the code 6 times resulting in performance improvements each step. In the end his 6 steps of code improvement results in a 30X performance in the algorithm. My point here is that it would be easy to write your benchmark and have it show not being fast enough where some code optimization may result in a workable solution.

Also the effort of porting this benchmark to the GTX 950 will be a very transferable education when you then attempt to port the code to the Jetson TX1.

Chuck