Significant speed gap between CUDA and OpenCL - how to debug?

hi everyone,

we have a new paper published just a few days ago on an OpenCL Monte Carlo photon simulator. From our tests, shown as the inset in Fig. 2 (attached below), we notice a huge speed gap between running the OpenCL version of our code (http://github.com/fangq/mcxcl) vs the CUDA version (http://github.com/fangq/mcx) on most tested NVIDIA GPUs. The CUDA-based simulation speed is about 2x to 5x faster than the OpenCL-based simulation, except GTX 1050Ti, which is 1-to-1.

From other papers comparing CUDA and OpenCL, the speed difference found in our study is quite high. We understand that NVIDIA’s CUDA driver is more up-to-date than its OpenCL driver, however, we still feel that alone is not enough to explain the difference observed.

the other curious data point is GTX 1050Ti. This is the only GPU that OpenCL has comparable speed to CUDA. However, this result is only after we enabled two control-flow related optimizations (see the jump from “x” to “#” stacked bars for 1050Ti):

https://github.com/fangq/mcxcl/blob/master/src/mcx_core.cl#L526-L531
and
https://github.com/fangq/mcxcl/blob/master/src/mcx_core.cl#L278-L280

enabling of the above two code-blocks makes OpenCL 1.8x faster than without them, and pushes the speed comparable to CUDA.

very often, large speed improvement following a small change is often a result of fragile compiler predicates. We’ve reported similar drastic changes of speed in older CUDA drivers. However, for OpenCL on nvidia GPUs, we have very limited ways to tell what is happening. There is no Opencl profiler like nvvp, and -cl-nv-verbose also does not tell much.

I am curious what do you think about this? is there any tool or technique we can use to find out why the OpenCL version is slower?

The following is quite speculative, not any kind of conclusive analysis.

The code may be limited by memory throughput on the GTX 1050 Ti, while it is (partially) limited by computation throughput on GPUs with higher memory bandwidth. Do you have a roofline performance model for this application?

I don’t have time to dig into your code in detail, but based on a cursory glance it seems it involves some amount of transcendental functions, in particular trigonometry. So the heavy computational load may be (partially) related to those and may also contribute to different speedups between CUDA and OpenCL. NVIDIA pretty much froze OpenCL four or five years ago, while CUDA is being optimized continuously.

However your specific optimization seem to involve related to thread divergence surrounding use of the curious mcx_nextafter() function? This may be partially connected to the use of more advanced compiler technology in the CUDA toolchain, which may lead to better handling of possibly divergent branches in CUDA vs OpenCL. Only a detailed analysis of the generated machine code would be able to confirm or refute that hand-wavy assumption. Can you get at the SASS in the OpenCL environment. Profiler stats would also help understand the effects of these local code changes, but, alas, not available for OpenCL.

I spotted some code idioms that may not represent best practice. When I see ‘t = x*M_PI; r = sincos (t)’, it suggests sincospi() should be used instead of sincos() for accuracy and performance. Also there are expressions like ‘sin(acos(x))’ that suggest they might be replaceable by algebraic computation, unless the intermediate angle is used elsewhere. E.g. closely related cases:

sin(atan2(y,x) = yrhypot(x,y)
cos(atan2(y,x) = x
rhypot(x,y)

thanks njuffa, your comments have always been helpful.

we know this particular kernel is compute-bound. from the profiler output of the CUDA implementation, memory latency only accounts for 3-4% of the total latency (as oppose to 41% due to execution dependency and 23% due to instruction fetch). We also run the kernel with a large number of threads, so the global memory latency was effectively hidden.

For both the OpenCL and CUDA versions, this kernel is largely bounded by registers. The OpenCL version has 48-58 registers (depending on JIT options and platforms) and the CUDA version has 64+ registers.

Regarding your question on “roofline performance”, to be honest, we are not sure for either the CUDA or the OpenCL version. I did run "nvprof --metrics flops_sp … " and estimated that the code is currently running at around 1 TFLOPS speed on a TITAN V GPU (and ~1.2 TFLOPS on the 1080Ti using CUDA 7.5). We know that TITAN V has a theoretical fp32 FLOPS at 13.8 TFLOPs, so our CUDA code is now at about 7%-10% of the theoretical max throughput (opencl version need to divide that by 2 to 3). I don’t really know how realistic application throughput compare to the max flops numbers, but for such a complex kernel, it was pleasant surprise.

the trigonometry functions are big part of the simulation, and accounts for about 15-30% of the total compute load (based on vtune profiling on the CPU using the OpenCL code, for nvidia GPUs, nvvp did not show large overhead at those lines for the cuda code). I agree there is room to cut some small corners, but I don’t think it is the reason for the difference between OpenCL and CUDA.

I wish there is a tool for line-by-line profiling for OpenCL, like nvvp for CUDA on nvidia GPUs, I probably will find a lot of useful hints for what happened.

Comparing the machine code instruction counts for the 1050Ti before and after using the flow optimization could be one way to proceed, I will ask my colleague to delve into it. But I guess it won’t tell us why the counts are different (if found true).

Given the nature of the GPU architecture, handling branches intelligently (especially in deeply nested codes) is one of the more important optimizations performed by the CUDA compiler. And often, optimizing away branches opens up new opportunities for additional optimizations (e.g. CSE or instruction scheduling)

I think it is entirely possible that more advanced branch optimizations in modern CUDA vs older OpenCL infrastructure might explain why your branch-removing manual optimization aren’t necessary with CUDA. Only in-depth analysis of the generated SASS (machine code) could confirm that, and it would involve more than just counting instructions, which doesn’t account for the effects of divergence. Such analysis is time-consuming in my experience, because it requires back-annotating the machine code to see what is happening, at a speed of maybe 50 instructions per hour (the high degree of optimization applied by the tool chain usually transforms code heavily making it difficult to follow at SASS level).

As for the transcendental functions in CUDA, significant performance improvements have been made in the past four to five years since OpenCL was more or less frozen. But I agree any influence at app level would be limited, maybe 1.2x here in terms of order of magnitude, not 2x-5x as shown in your graph.

It is interesting that you mention higher register use by the CUDA code. The CUDA compiler may chose to use registers more generously to boost performance, for example by storing to temporary variables instead of recomputing certain expressions multiple times.