hi everyone,
we have a new paper published just a few days ago on an OpenCL Monte Carlo photon simulator. From our tests, shown as the inset in Fig. 2 (attached below), we notice a huge speed gap between running the OpenCL version of our code (http://github.com/fangq/mcxcl) vs the CUDA version (http://github.com/fangq/mcx) on most tested NVIDIA GPUs. The CUDA-based simulation speed is about 2x to 5x faster than the OpenCL-based simulation, except GTX 1050Ti, which is 1-to-1.
From other papers comparing CUDA and OpenCL, the speed difference found in our study is quite high. We understand that NVIDIA’s CUDA driver is more up-to-date than its OpenCL driver, however, we still feel that alone is not enough to explain the difference observed.
the other curious data point is GTX 1050Ti. This is the only GPU that OpenCL has comparable speed to CUDA. However, this result is only after we enabled two control-flow related optimizations (see the jump from “x” to “#” stacked bars for 1050Ti):
https://github.com/fangq/mcxcl/blob/master/src/mcx_core.cl#L526-L531
and
https://github.com/fangq/mcxcl/blob/master/src/mcx_core.cl#L278-L280
enabling of the above two code-blocks makes OpenCL 1.8x faster than without them, and pushes the speed comparable to CUDA.
very often, large speed improvement following a small change is often a result of fragile compiler predicates. We’ve reported similar drastic changes of speed in older CUDA drivers. However, for OpenCL on nvidia GPUs, we have very limited ways to tell what is happening. There is no Opencl profiler like nvvp, and -cl-nv-verbose also does not tell much.
I am curious what do you think about this? is there any tool or technique we can use to find out why the OpenCL version is slower?