Same kernel 3x slower on CUDA than on OpenCL

stolk · May 2, 2023, 10:35pm

I have been porting a CUDA kernel to OpenCL, and on the same GPU (rtx3070) I now get triple the performance on OpenCL compared to CUDA.

The code is the same, and uses half precision floats. __half on CUDA and half on OpenCL.

I am investigating why the CUDA kernel is so slow, using both ncu and ncu-ui tools.

What I’ve learned so far:

Speed-Of-Light of compute is 50% and SoL memory is 3%.

I have 50% eligible warps, and active warps is close to its theoretical max of 12.

Issued warps/scheduler is 0.5

The profiler warns me that 17 cycles are wasted on Immediate Constant Cache.

This leaves me wondering: does that mean immediate values in my code? Or does it include data marked with __constant__ too?

Where is the 64KB of constant memory shown in this image?

NOTE: I have a lot of work per workitem, so I expect the memory throughput to be low.

NOTE: Ironically, the much faster OpenCL is implemented in CUDA: Platform FULL_PROFILE OpenCL 3.0 CUDA 12.1.68 NVIDIA CUDA NVIDIA

PS: Is nsight able to profile OpenCL kernels? When I try, I get a message that no kernels were profiled. So far it only worked with CUDA for me?

jmarusarz · May 3, 2023, 4:34pm

Thanks for reaching out. Give us a little time to investigate this and see if we can come to any conclusions.

With respect to OpenCL, Nsight Compute does not support profiling OpenCL kernels.

Greg · May 4, 2023, 12:04am

In order to get performance advice it is recommended that you provide a minimal reproducible so that other forum readers can compile, profile, and provide feedback. Without more information on the workload and a reproducible it is difficult to provide help.

Nsight Compute does not show the constant cache data path in the Memory Statistics Diagram. IMC accesses are embedded in the SASS instruction in the form

c[bank][offset]

bank = 0 contains kernel parameters and per grid launch values such as GridDim and BlockDim.
bank = 3 user defined constant values.

A value of 17 cycles for smsp__average_warps_issue_stalled_imc_miss_per_issue_active.ratio may indicate that the kernel is thrashing the constant cache or may mean you have very little work per thread and this is the initial cache miss penalty.

It is possible to find the source lines (or instructions) with the highest stall reasons in the Source View by looking at the Warp Stall Sampling column. If you do not see source code then you need to pass -lineinfo to NVCC when compiling device code in order to generate the line table information.

stolk · May 4, 2023, 4:12pm

Thank you Greg,

It’s good to know about those banks.

I do indeed see the lines that cause this issue, and the corresponding instrucions use bank 3.

What I don’t understand is that the cache hit rate is low, because I process the data sequentially. All threads loop the same arrays of constant data. I use 6 arrays in SoA format, all x-values grouped, all y-values, all z-values. All values are fp16.

Greg · May 4, 2023, 9:24pm

The ULDC instruction is load the data from the IMC cache into a warp uniform register. The IMAD instructions are being used to move the value from warp uniform register to thread register.

The ULDC instruction has to retire before the next instruction can retire. The IMC cache is per SM sub-partition (4 per SM). The IMC capacity is a few KiB and the cache line size is small. For each loop you will either hit or miss 6 cache lines. When you miss it is likely that all 6 operations will miss. It appears that all warps are doing the same constant reads so it is likely that all warps will stall at the same location.

The IMC cache has a maximum return of 32-bits so by reading fp16 vs. fp16x2 you are only getting 1/2 the B/W.

From the image I cannot tell the size of each array.

The best approach to improving the performance is to make sure you use the full return B/W of the cache. Given the very small number of registers it may be possible to unroll the loop 2x and use fp16x2. This may require padding numboxes. Without restructuring to read more than 16-bit per load I don’t think you will gain from using L1. L1 can return 128B/cycle/SM. In the current code you would be only using 2B/cycle/SM.

stolk · May 4, 2023, 10:35pm

Thank you, Greg.
Again, very helpful information.

I think I made a naive invalid assumption that SoA is always better that AoS.

And although I think that it is important to line-up as SoA for the data you split up between threads, I don’t think I should have used SoA for data that is looped over within a thread.

The latter type of data is probably better served up as x/y/z/x/y/z/… so that it is all nearby in the cache, as opposed to go SoA.

I guess I was trying to make it easier for the cuda compiler to use vector instructions, instead of scalar instructions.

That said, my OpenCL kernel has the exact same code and data layout, and is much faster.
The only real difference is that for the CUDA kernel, I explicitly put some data in __constant__ data, which I do not do for the OpenCL version. I’m not even sure if you can explicitly assign data to the 64KByte read-only cache when using OpenCL?

Thanks again,

Bram

stolk · May 5, 2023, 6:15pm

I was able to claw back all that performance by redoing my memory accesses.

Before I was loading some global memory conditionally, using a branch.

Now I load for both branches unconditionally, and later select one or the other with a ? operator.

My eligible warps were boosted +400% after I did that.

Now, my cuda kernel runs as fast as the OpenCL kernel.

system · May 19, 2023, 6:16pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
trying to understand kernel parameters and CL_INVALID_WORK_GROUP_SIZE CUDA Programming and Performance	8	3988	February 26, 2010
Same Implementation in CUDA and OpenCL but different performance, and OpenCL Faster? CUDA Programming and Performance	2	1223	October 11, 2013
Wish List for next OpenCL release CUDA Programming and Performance	9	17435	September 9, 2009
Odd performance problem/question CUDA Programming and Performance	3	837	June 3, 2009
Same kernel and data exhibits different performance CUDA Programming and Performance	3	491	December 3, 2021
Why CUDA slower that OpenCL? CUDA Programming and Performance	5	1529	September 12, 2018
First kernel run is slower than succeeding CUDA Programming and Performance	9	2702	May 16, 2022
OpenCL kernel vs CUDA kernel why so different? I see very different performance for almost similar k CUDA Programming and Performance	1	15567	April 14, 2011
Kernel with very low eligible warps despite fully coalesced memory access CUDA Programming and Performance	7	1082	July 17, 2023
Significant speed gap between CUDA and OpenCL - how to debug? CUDA Programming and Performance	3	7535	January 28, 2018

Same kernel 3x slower on CUDA than on OpenCL

Related topics