Doing recently some benchmarks and wonder if my host-device latencies are
bound to my older hardware or are similar on newer systems?
OS: Ubuntu 18.04 x86-64
Device: Nvidia GTX 750, 1 GHz, 512 cores, 1 TFLOPs
OpenCL gpu kernel calls (terminated with clfinish), 1 million threads, no memory buffer transfer and empty kernel:
~35K calls per second
OpenCL gpu kernel calls (terminated with clfinish), 1 million threads, with 8 KB memory write and 4 KB memory read transfer and empty kernel:
~10K calls per second
Note that my machine is a bit outdated:
- PCIe via Northbridge
- PCIe 2.0
- only 8 lanes per slot
Maybe on newer systems the latencies do not hurt at all?
Thanks in advance,
I have no idea what you are measuring, and I have had zero exposure to OpenCL. Under CUDA, the minimal observed kernel launch time is 5 microseconds for null kernels, meaning that there can be at most 200,000 kernel invocations per second. That minimal launch overhead has basically not changed much in about a decade, and the limiter appears to be the basic latency of the PCIe link. It is generally a good idea to design for minimal kernel execution time > 1 millisecond.
PCIe version and width impact primarily PCIe throughput, with little impact on PCIe latency. For minimum software overhead in the host-side driver stack, a CPU with high single-thread performance is recommended. At this time I would recommend a CPU with > 3.5 GHz base frequency as optimal.
Thanks, this is exactly what i was looking for.
I can change my design to device based computation with about 1 second per run.