I got a rather classic UNet network that was trained via tensorflow and which got converted to coreml format for inference on mac hardware and to onnx format for windows.
The ONNX network has been converted to half float precision.
The code used for the inference is using high level API (using the EvaluateAsync + custom tensorization on winML, and the auto generated coreml code).
The mac is from 2016, it has a radeon 575 with 4GB of video memory under mojave ; running inference on a 12 Mpix 16 bits unsigned is done in 4s and running inference on a 80 Mpix 16 bits unsigned is done in 20 seconds.
The PC has an intel 9700K and a rtx 2080 ti with 11 Gb of VRAM with windows 1903, drivers 440.97 ; running inference on the same 12 Mpix takes 6s, and running the inference on the same 80 Mpix is done in 26s.
I think I did something wrong since a middle range gpu from 3 years ago is faster from a high end gpu from this year, but I don’t know what ; is there a way to test an ONNX model without winML and using an nvidia optimised framework ? Or is there any explanation or things I’m missing ?