Same kernel 3x slower on CUDA than on OpenCL

I have been porting a CUDA kernel to OpenCL, and on the same GPU (rtx3070) I now get triple the performance on OpenCL compared to CUDA.

The code is the same, and uses half precision floats. __half on CUDA and half on OpenCL.

I am investigating why the CUDA kernel is so slow, using both ncu and ncu-ui tools.

What I’ve learned so far:

Speed-Of-Light of compute is 50% and SoL memory is 3%.

I have 50% eligible warps, and active warps is close to its theoretical max of 12.

Issued warps/scheduler is 0.5

The profiler warns me that 17 cycles are wasted on Immediate Constant Cache.

This leaves me wondering: does that mean immediate values in my code? Or does it include data marked with __constant__ too?

Where is the 64KB of constant memory shown in this image?

NOTE: I have a lot of work per workitem, so I expect the memory throughput to be low.

NOTE: Ironically, the much faster OpenCL is implemented in CUDA: Platform FULL_PROFILE OpenCL 3.0 CUDA 12.1.68 NVIDIA CUDA NVIDIA

PS: Is nsight able to profile OpenCL kernels? When I try, I get a message that no kernels were profiled. So far it only worked with CUDA for me?

Thanks for reaching out. Give us a little time to investigate this and see if we can come to any conclusions.

With respect to OpenCL, Nsight Compute does not support profiling OpenCL kernels.

In order to get performance advice it is recommended that you provide a minimal reproducible so that other forum readers can compile, profile, and provide feedback. Without more information on the workload and a reproducible it is difficult to provide help.

Nsight Compute does not show the constant cache data path in the Memory Statistics Diagram. IMC accesses are embedded in the SASS instruction in the form

c[bank][offset]

bank = 0 contains kernel parameters and per grid launch values such as GridDim and BlockDim.
bank = 3 user defined constant values.

A value of 17 cycles for smsp__average_warps_issue_stalled_imc_miss_per_issue_active.ratio may indicate that the kernel is thrashing the constant cache or may mean you have very little work per thread and this is the initial cache miss penalty.

It is possible to find the source lines (or instructions) with the highest stall reasons in the Source View by looking at the Warp Stall Sampling column. If you do not see source code then you need to pass -lineinfo to NVCC when compiling device code in order to generate the line table information.

1 Like

Thank you Greg,

It’s good to know about those banks.

I do indeed see the lines that cause this issue, and the corresponding instrucions use bank 3.

What I don’t understand is that the cache hit rate is low, because I process the data sequentially. All threads loop the same arrays of constant data. I use 6 arrays in SoA format, all x-values grouped, all y-values, all z-values. All values are fp16.

The ULDC instruction is load the data from the IMC cache into a warp uniform register. The IMAD instructions are being used to move the value from warp uniform register to thread register.

The ULDC instruction has to retire before the next instruction can retire. The IMC cache is per SM sub-partition (4 per SM). The IMC capacity is a few KiB and the cache line size is small. For each loop you will either hit or miss 6 cache lines. When you miss it is likely that all 6 operations will miss. It appears that all warps are doing the same constant reads so it is likely that all warps will stall at the same location.

The IMC cache has a maximum return of 32-bits so by reading fp16 vs. fp16x2 you are only getting 1/2 the B/W.

From the image I cannot tell the size of each array.

The best approach to improving the performance is to make sure you use the full return B/W of the cache. Given the very small number of registers it may be possible to unroll the loop 2x and use fp16x2. This may require padding numboxes. Without restructuring to read more than 16-bit per load I don’t think you will gain from using L1. L1 can return 128B/cycle/SM. In the current code you would be only using 2B/cycle/SM.

1 Like

Thank you, Greg.
Again, very helpful information.

I think I made a naive invalid assumption that SoA is always better that AoS.

And although I think that it is important to line-up as SoA for the data you split up between threads, I don’t think I should have used SoA for data that is looped over within a thread.

The latter type of data is probably better served up as x/y/z/x/y/z/… so that it is all nearby in the cache, as opposed to go SoA.

I guess I was trying to make it easier for the cuda compiler to use vector instructions, instead of scalar instructions.

That said, my OpenCL kernel has the exact same code and data layout, and is much faster.
The only real difference is that for the CUDA kernel, I explicitly put some data in __constant__ data, which I do not do for the OpenCL version. I’m not even sure if you can explicitly assign data to the 64KByte read-only cache when using OpenCL?

Thanks again,

Bram

I was able to claw back all that performance by redoing my memory accesses.

Before I was loading some global memory conditionally, using a branch.

Now I load for both branches unconditionally, and later select one or the other with a ? operator.

My eligible warps were boosted +400% after I did that.

Now, my cuda kernel runs as fast as the OpenCL kernel.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.