host-device latencies?

s.matovic · February 28, 2019, 2:24pm

Doing recently some benchmarks and wonder if my host-device latencies are
bound to my older hardware or are similar on newer systems?

OS: Ubuntu 18.04 x86-64
Device: Nvidia GTX 750, 1 GHz, 512 cores, 1 TFLOPs

OpenCL gpu kernel calls (terminated with clfinish), 1 million threads, no memory buffer transfer and empty kernel:

~35K calls per second

OpenCL gpu kernel calls (terminated with clfinish), 1 million threads, with 8 KB memory write and 4 KB memory read transfer and empty kernel:

~10K calls per second

Note that my machine is a bit outdated:

PCIe via Northbridge
PCIe 2.0
only 8 lanes per slot

Maybe on newer systems the latencies do not hurt at all?

Thanks in advance,
Srdja

njuffa · February 28, 2019, 6:38pm

I have no idea what you are measuring, and I have had zero exposure to OpenCL. Under CUDA, the minimal observed kernel launch time is 5 microseconds for null kernels, meaning that there can be at most 200,000 kernel invocations per second. That minimal launch overhead has basically not changed much in about a decade, and the limiter appears to be the basic latency of the PCIe link. It is generally a good idea to design for minimal kernel execution time > 1 millisecond.

PCIe version and width impact primarily PCIe throughput, with little impact on PCIe latency. For minimum software overhead in the host-side driver stack, a CPU with high single-thread performance is recommended. At this time I would recommend a CPU with > 3.5 GHz base frequency as optimal.

s.matovic · March 1, 2019, 6:25am

Thanks, this is exactly what i was looking for.

I can change my design to device based computation with about 1 second per run.

–
Srdja

Topic		Replies	Views
kernel launch latency CUDA Programming and Performance	16	7734	August 6, 2018
Latency host <-> device memory for small blocks CUDA Programming and Performance	8	2177	September 11, 2009
Kernel launch latency and streams CUDA Programming and Performance	4	1211	September 13, 2017
Memory copy latency and kernel launch overhead CUDA Programming and Performance	3	7442	May 17, 2011
Losing 800us to PCIe latency per Kernel launch Looking for tweaks and optimizations to minimize PCIe CUDA Programming and Performance	1	13906	March 23, 2011
reduces kernel launch latency? CUDA Programming and Performance	6	12944	July 6, 2008
reduce overhead of launching a new thread block CUDA Programming and Performance	15	4580	February 15, 2018
CUDA Graphs Impact CUDA Programming and Performance	2	468	September 17, 2021
Cuda Dynamic Parallelism Launch Overhead CUDA Programming and Performance	5	2180	March 17, 2017
Dispatch Kernel Overhead (OpenCL) CUDA Programming and Performance	6	3550	March 28, 2017

host-device latencies?

Related topics