Latency of a GPU implemented algorithm

I’m an FPGA designer - with C experience but no GPU / CUDA experience. I need to implement a dictionary based image correction algorithm. It’s very computation heavy and the #1 concern is Latency - I.E: The time the first unprocessed pixel “entered” the machine minus the time the first processed pixel “left”. The target is < 40ms.

The algorithm maps well to an FPGA ( and will definitely meet the latency requirements if all the processing is done on the FPGA ) but I’m also considering a mixed FPGA & GPUs design to improve the development cycle. In the mixed FPGA / GPU design - the FPGA won’t do any processing only handle data transport.

I’m asking the experts here to give a VERY ROUGH ESTIMATE of the latency ( in milliseconds ) if the processing part of the algorithm is implemented on a GeForce GTX 1080.

This is how it works:

Pixels arrive to the FPGA in a raster scan fashion.
The FPGA packets them and sends them to the host via PCIe.
The host and the GPU have to do the following:
Transfer the image into 7*7 pixel tiles - each tile overlaps the neighboring tile with 42 pixel ( which is 7 * 7 - 7 = 42 ). Think of it as tile raster scan… For EVERY TILE do 256 convolutions ( each of the 256 convolutions is done with different set of coefficients - each set is essentially a dictionary element ). Compare all 256 convolutions and return the the INDEX ( a number from 0 to 255 that represents the dictionary element ) that gave the highest result. Supplement the original pixel with this dictionary element and return the result via PCIe to the FPGA.

Each frame is 1280 * 1280 and the frame rate is 30FPS. According to my calculation - there’re ~400 million 7*7 convolutions to do on each frame and ~1.4 million compensations.

Do you think the GeForce GTX 1080 can handle the task in terms of throughput ? If it can - do you think it’ll meet the < 40mS latency requirement ?