I have an image processing application that needs to process many incoming images. The image sizes will vary. The hardware is multiple Tesla T4s. Currently, we have a consumer thread for each device, and process the images in parallel across multiple devices. Using some tools suggests we are not fully utilizing the GPUs, however. So I tried to give each device multiple command queues (I’m using OpenCL) so that each GPU can process multiple images in parallel and potentially hide memory transfers to and from the GPU. However, the performance is actually worse when I do this.
Does Tesla T4 support parallel dispatch? Is using multiple command queues per device the wrong way to go? I know I’m using OpenCL and not CUDA, but hopefully the strategy would be similar.
Would it be better to just use two queues per device: copy queue and worker queue?
Thanks in advance.