Increasing kernel dispatch overhead for multiple GPUs

Hi all,

I am using the following Amazon AWS instances to test an OpenCL-based program:
1- p3.2xlarge containing 1 NVIDIA Tesla V100 GPU
1- p3.16xlarge containing 8 NVIDIA Tesla V100 GPUs

With a single GPU, the program works smoothly as expected.
However, the program runs slower and slower as i increase the number of GPUs !
I did some profiling and measurement of each Opencl-API-call, as well as, the execution times of the kernel functions.
The bottleneck is found to be the call to the (enqueueNDRangeKernel), C++ binding function, which is simply wrapper around the OpenCL C function (clEnqueueNDRangeKernel).

With one GPU device, the call to (enqueueNDRangeKernel) is acceptable (less than one microsecond).
However, it increases more and more as I increase the number of devices.
With 8 devices, the call overhead is around 90 milliseconds !
Here are the technical details:

0- The Amazon machines come with NVIDIA drivers version 9.0.176 and run on Amazon Linux (Linux Kernel 4.14).
1- I create a single OpenCL context to group all devices in one context.
2- For each device, I create a command queue with in-order execution.
2- I create buffers in the context to be shared among the devices.
3- For each opencl-kernel-function, and for each command-queue, I create an OpenCl-kernel and assign its arguments using sub-buffers of the main buffer.
4- My program calls (enqueueNDRangeKernel) several times to enqueue all kernels to the command-queues/devices and at this point the bottleneck is observed.
5- My program flushes (calls the C++ wrapper of clFlush()) the command-queues. No bottleneck is observed here.
6- My program waits for all kernels to finish (calls the C++ wrapper of clFinish()). Kernels were found to run with predictable execution time at this point.

Notes:

  • I measure the delay of the call to (clEnqueueNDRangeKernel) from the host side (i.e. C++ time measurement functions).
  • I measure the execution times of the kernels by the OpenCL API (events + getProfilingInfo() function)

A similar thread is here (https://devtalk.nvidia.com/default/topic/1001155/dispatch-kernel-overhead-opencl-/) but it is for one GPU and the cause was probably from NVIDIA Windows drivers.
Also, a similar one here https://devtalk.nvidia.com/default/topic/466227/?comment=3312620 for a single GPU and no one replied to the poor guy.

I’d appreciate if some one can help or answer the following questions:

  • Has any one faced a similar behavior ?
  • Can it be a driver issue ?
  • Any possible work-around ?

Regards,