Increasing kernel dispatch overhead for multiple GPUs

mkhaled0ohsr · July 30, 2018, 3:46am

Hi all,

I am using the following Amazon AWS instances to test an OpenCL-based program:
1- p3.2xlarge containing 1 NVIDIA Tesla V100 GPU
1- p3.16xlarge containing 8 NVIDIA Tesla V100 GPUs

With a single GPU, the program works smoothly as expected.
However, the program runs slower and slower as i increase the number of GPUs !
I did some profiling and measurement of each Opencl-API-call, as well as, the execution times of the kernel functions.
The bottleneck is found to be the call to the (enqueueNDRangeKernel), C++ binding function, which is simply wrapper around the OpenCL C function (clEnqueueNDRangeKernel).

With one GPU device, the call to (enqueueNDRangeKernel) is acceptable (less than one microsecond).
However, it increases more and more as I increase the number of devices.
With 8 devices, the call overhead is around 90 milliseconds !
Here are the technical details:

0- The Amazon machines come with NVIDIA drivers version 9.0.176 and run on Amazon Linux (Linux Kernel 4.14).
1- I create a single OpenCL context to group all devices in one context.
2- For each device, I create a command queue with in-order execution.
2- I create buffers in the context to be shared among the devices.
3- For each opencl-kernel-function, and for each command-queue, I create an OpenCl-kernel and assign its arguments using sub-buffers of the main buffer.
4- My program calls (enqueueNDRangeKernel) several times to enqueue all kernels to the command-queues/devices and at this point the bottleneck is observed.
5- My program flushes (calls the C++ wrapper of clFlush()) the command-queues. No bottleneck is observed here.
6- My program waits for all kernels to finish (calls the C++ wrapper of clFinish()). Kernels were found to run with predictable execution time at this point.

Notes:

I measure the delay of the call to (clEnqueueNDRangeKernel) from the host side (i.e. C++ time measurement functions).
I measure the execution times of the kernels by the OpenCL API (events + getProfilingInfo() function)

A similar thread is here ([url]https://devtalk.nvidia.com/default/topic/1001155/dispatch-kernel-overhead-opencl-/[/url]) but it is for one GPU and the cause was probably from NVIDIA Windows drivers.
Also, a similar one here [url]https://devtalk.nvidia.com/default/topic/466227/?comment=3312620[/url] for a single GPU and no one replied to the poor guy.

I’d appreciate if some one can help or answer the following questions:

Has any one faced a similar behavior ?
Can it be a driver issue ?
Any possible work-around ?

Regards,

Topic		Replies	Views
Dispatch Kernel Overhead (OpenCL) CUDA Programming and Performance	6	3620	March 28, 2017
OpenCL performance issues CUDA Programming and Performance	5	1925	January 25, 2010
kernel launch time way too long CUDA Programming and Performance	6	4026	July 5, 2011
Tesla C2050 - OpenCL - Kernel Concurrency Issue CUDA Programming and Performance	2	1188	June 3, 2014
launch kernels in parallel? CUDA Programming and Performance	16	23993	July 29, 2010
No speedup from run 2 kernels concurrently on a gpu device of compute capability 2.0 CUDA Programming and Performance	4	1533	October 10, 2011
OpenCL Asynchronous Kernel Launches CUDA Programming and Performance	6	2730	May 24, 2022
How expensive is calling clEnqueueNDRangeKernel? CUDA Programming and Performance	1	7872	May 12, 2010
Is task parallel programming (aka "concurrent kernels") in OpenCL supported? CUDA Programming and Performance	1	4287	August 31, 2011
OpenCL performance difference Linux/Windows CUDA Programming and Performance	0	1161	May 1, 2013

Increasing kernel dispatch overhead for multiple GPUs

Related topics