Significant speedup of OpenCL vs CUDA

I’m writing some GPU kernel. I cannot share the code yet because it is going to be published in an academic paper. But I can tell that it basically works on a dynamic programming table. So there are some random accesses to a table stored in memory, and some simple integer arithmetic and conditional statements to decide what values to store in other cells of the table.

I started with an OpenCL implementation achieving a good performance. Then migrated to CUDA, expecting higher performance.

My surprise is that CUDA kernel is 7x times slower. I am trying to understand why, but it’s still obscure to me.
My first roadblock is that I cannot profile the OpenCL kernel. It seems that OpenCL profiling is not supported by NSight.

Initially I though the performance difference could be because of using a wrong CUDA compilation toolchain, but I am using the last versions of CUDA tools. So this is discarded.

Then I though about a poor memory usage in my CUDA version, but NSight profiling tool reports that my kernel is compute bound and I am already using shared memory for maximum performance.

NSight reports that ALU pipeline utilization is 88%.

The question is , is it possible that the OpenCL compiler is doing a much better job than the CUDA one. To me, it seems unlikely.

The kernel code is almost identical. On the host side there is a minor difference, since in OpenCL arguments to kernels are passed individually with the clSetKernelArg function. Since I call my kernel a lot (thousands of invocations per second), I am avoiding to set the arguments that are not chaning (in OpenCL).
But I discard this could be related to the performance degradation because the profiler would complain too low workload…

Does anyone has an idea of what could be happening?

I have tested this in RTX3090,RTX3080 and RTX2080ti. I’m getting always the same behaviour on those devices.

Thanks in advance.

1 Like

are you compiling a debug project, or using debug compilation switches? (indicating your operating system and compilation command line may help)

Working on Ubuntu Linux 20.04

Compiler version:
/usr/local/cuda/bin/nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Mon_May__3_19:15:13_PDT_2021
Cuda compilation tools, release 11.3, V11.3.109
Build cuda_11.3.r11.3/compiler.29920130_0

Compilation Flags:
SM=86
COMPUTE=$(SM)
CC=/usr/local/cuda/bin/nvcc
CFLAGS= -g -gencode arch=compute_$(COMPUTE),code=sm_$(SM) -O4

“Objection, your honor! Calls for speculation.”

You are correct to be suspicious. Typical real-life projects see performance gains when moving code from OpenCL to CUDA. Unfortunately, there is not much information to go on. I am afraid I do not know what “works on a dynamic programming table” means.

[Oops, missed on first reading; should have put my reading glasses on]: That -g on the nvcc commandline very much seems to indicate a debug build. From nvcc manual:

4.2.3.2. --debug (-g) Generate debug information for host code.

Given that, you can probably ignore these musings:

Brainstorming in random order, before the first coffee of the day:

(1) The CUDA and OpenCL version do not actually perform the same calculations. There may be redundant computation performed in the CUDA code or the computed results do not actually match completely. Undiscovered due to holes in automated tests. Or CUDA code inadvertently configured to deliver higher-quality results requiring more iterations.

(2) Since OpenCL uses its own integer data types, while CUDA maintains compatibility of data types between host and device code, the data types used do not actually match between the versions (e.g. there are more instances of 64-bit integers used in CUDA code).

(3) OpenCL automatically sizes blocks and grids (I think), but CUDA requires programmers to do this manually. The CUDA code uses sub-optimal partitioning.

(4) The default compiler flags for floating-point arithmetic may differ between OpenCL and CUDA compilation. What happens if -use_fast_math is used for CUDA compilation?

(5) I have never used OpenCL. CUDA allows separate compilation, which often has a negative performance impact which can largely be compensated with link-time optimization. Do the OpenCL and CUDA builds match in the use of whole-program vs separate-compilation-modules programming style?

Normally this would be a red flag, as poor memory access patterns can have a large impact on performance, and the patterns of the OpenCL code might not match the patterns in the CUDA code exactly. But this caveat does not seem applicable when

It wasn’t what I was looking for, however. I was looking for -G. I agree it’s not conclusive to look at a snippet of a build “makefile”. Better to see the actual compilation command.

I agree it is best (most conclusive) to see the actual compiler component invocations.

My logic here is that where there is a -g in CUDA compilation, a -G is typically right behind it. And I don’t think I have ever encountered a really large performance difference like this without an inadvertent comparison of debug with release build.

FWIW, I am not sure what the -O4 is supposed to do. Best I know, that is not something nvcc can make use of. Maybe -Xcompiler -O4 was intended?

I included some of your suggestions, still the same behaviour.
Here is the actual invocation

/usr/local/cuda/bin/nvcc -Xcompiler -O4 -g -gencode arch=compute_86,code=sm_86 -use_fast_math -I . -c app.cpp -o app.o

/usr/local/cuda/bin/nvcc -Xcompiler -O4 -g -gencode arch=compute_86,code=sm_86 -use_fast_math app.o -o app

The data provided so far does not make sense to me. I don’t have a mental model that would explain the observation. Maybe Robert Crovella has some additional ideas.

What happens if you remove -g?
What happens if you change -Xcompiler -O4 to -Xcompiler -O3 and -Xcompiler -O2?

Same behaviour.
I understand it’s hard to tell without looking at code.
I will try to replicate the behaviour with an example I can share…
I will take me some days

Well, at least that confirms that the performance of host code is highly unlikely to play any role in this issue, and that it all comes down to what is happening on the GPU.

That couldn’t possibly contain any CUDA device code.

sorry, it’s app.cu… I avoided to write the original file names

I think there is a decent chance you will discover the root cause for your observation as you work on isolating and reducing the relevant code into a minimal self-contained example.

The intriguing part for me is that the code (based on the data made available) is apparently bottlenecked on integer-intensive computation. Other than the issue of integer types that I raised (GPUs are 32-bit architectures and 64-bit integer operations emulated, so gratuitous 64-bit integer computation should be avoided) I cannot think of anything in that area that would be affected by OpenCL / CUDA differences.

It looks like I’ve been able to replicate the behaviour in a much simpler kernel.
Find it below…

I started doing accumulation over a range of values, then selecting the max using a max2 macro, and finally using a max3 macro.

CUDA is typically faster than OpenCL … until I use the max3 macro, then it turns to be much slower than OpenCL.

In an RTX3080 the execution time difference I get is: 2.5 (OpenCL) vs 7.4 (CUDA)

#include <cstdlib>
#include <assert.h>
#include <CL/cl.h>
#include <stdio.h>
#include <string.h>
#include <sys/time.h>


// compile using 
// /usr/local/cuda/bin/nvcc -g -Xcompiler -O3 -gencode arch=compute_86,code=sm_86 -use_fast_math test_cuda_kernel.cu -o test_cuda_kernel

#define max2(a,b) ((a)>(b))? (a):(b)
#define max3(a,b,c) max2((a), max2((b), (c)))

__global__  void test(long* pin, long* pout, long n)
{
   long gid = blockIdx.x;

   long sum = 0;
   for (long i = 0; i < n; i++)
   {
       long idx = gid - n;
       long idx2 = idx +1;
       if (idx > 0 && idx2 < gid)
           sum = max3(sum, pin[idx], pin[idx2]);
   }
   pout[gid] = sum;
}

struct timeval tnow;

double dtime(){
    gettimeofday(&tnow, NULL);
    return (double)tnow.tv_sec + (double)tnow.tv_usec * 1.0e-6;}

int main(int argc, char** argv)
{
    size_t size = 1024 * 1024 * 100;  // 1 GiB
    long* pin;
    long* pout;    
    cudaMalloc(&pin,  size * sizeof(long));
    cudaMalloc(&pout, size * sizeof(long));
    long n = 1000;    
    double t0 = dtime();
    test<<<size, 1>>>(pin, pout, n);    
    cudaDeviceSynchronize();            
    double tf = dtime();
    printf("n: %ld Ellapsed: %f\n", n, (tf-t0));
    
    return 0;
}

Surprisingly using the following inlined function (fmax3) instead of the macro reduces the execution time to 4.9 (CUDA)

__device__ __forceinline__ long fmax3(long a, long b, long c)
{
    long r = a;
    if (b > r)
        r = b;
    if (c > r)
        r = c;
    return r;
}

These behaviors seem strange to me. Maybe I’m missing something. I don’t think the possible differences in long types (OpenCL vs CUDA) can be the reason, as they were not affecting at all when using using a single max2 macro. Does anyone have a reasonable explanation?

FWIW this:

test<<<size, 1>>>

is not the way to write performant CUDA code.

Regarding this:

According to my testing the max3 and fmax3 test cases are equivalent performance-wise if I define the max3 macro like this:

#define max3(a,b,c) max2((c), max2((b), (a)))

(I’m not suggesting that is a detailed, in-depth explanation, mostly just an observation.)

If I were looking for insight into the OpenCL/CUDA perf difference, I would need to see the equivalent OpenCL test case also.

There aren’t any differences in the long datatype on linux between CUDA and OpenCL that I am aware of.

Friendly but urgent advice: Do not use long in your CUDA programs, ever.

CUDA maintains compatibility of types between host and device. This allows straightforward creation of __host__ __device__ code. On 64-bit Windows platforms, long is a 32-bit data type. On 64-bit Linux platforms, long is a 64-bit data type. Nasty surprises are almost guaranteed if CUDA code using long needs to be moved between Windows and Linux platforms. Use int and long long int (and their unsigned counterparts) as appropriate.

Per the OpenCL 1.2 specification (6.1.1 Built-in Scalar Data Types), long is always a 64-bit type. OpenCL has its own type system that is independent of the host platform.

General note on performance: Because GPUs use 32-bit architectures (with minimal extensions to handle 64-bit addressing), one would not want to use 64-bit integers unless one has to. All 64-bit integer operations with the exception of conversions from and to floating-point types are emulated.

I do not have time to dive into the details of code generation right now. From a first glance, using an older version of CUDA (and thus the CUDA toolchain), the two variants, max3() macro and fmax3() inline function, generate quite different code and the code generated from the macro looks like … worthy of improvement. It would probably be worthwhile to perform a detailed comparison using the latest CUDA toolchain, and (depending on the result), pass this to the CUDA compiler team as an enhancement request.

Since max() is a built-in function in CUDA equivalent to std::max in regular C++, I would suggest coding:

#define max3(a,b,c) max(max(a,b),c)

If the pin[] are restricted in range (e.g. always positive), it might be possible to craft a custom version of max() that is faster than the built-in function. The following general-purpose implementation may also be worth trying:

__device__ long long int my_max (long long int a, long long int b)
{
    return ((a ^ b) & (- (a >= b))) ^ b;
}

[Later:]

Using a recent toolchain, I don’t see any inefficiency in the generated code when using CUDA’s built-in max() for max3(). I tried both int and long long int for pin, pout.

1 Like

You are right, the difference between 4.9 and 7.4 is caused by the order of the comparisons.

About the block size, I’ve tried to remove grid issues from the comparison.
I assume OpenCL’s “workgroup size” is equivalent to CUDA’s “block size”.

Here you have the OpenCL code

// /usr/local/cuda/bin/nvcc -g -Xcompiler -O3 -gencode arch=compute_86,code=sm_86 -use_fast_math test_opencl_kernel.cpp -l OpenCL -o test_opencl_kernel

#include <cstdlib>
#include <assert.h>
#include <CL/cl.h>
#include <stdio.h>
#include <string.h>
#include <sys/time.h>

using namespace std;

const char* gCLSource = 
"#define max2(a,b) ((a)>(b))? (a):(b)\n"
"#define max3(a,b,c) max2((a), max2((b), (c)))\n"
"__kernel void test(\n"
"   __global long* pin,\n"
"   __global long* pout,\n"
"   long n)\n"
"{\n"
"   long gid = get_global_id(0);\n"
// "   printf(\"gid: %d\", gid);"
"\n"
"   long sum = 0;\n"
"   for (long i = 0; i < n; i++)\n"
"   {\n"
"       long idx = gid - n;\n"
"       long idx2 = idx +1;"
"       if (idx > 0 && idx2 < gid)\n"
"           sum = max3(sum, pin[idx], pin[idx2]);\n"
"   }\n"
"   pout[gid] = sum;\n"
"}\n";

struct timeval tnow;

double dtime()
{
    gettimeofday(&tnow, NULL);
    return (double)tnow.tv_sec + (double)tnow.tv_usec * 1.0e-6;
}

/*
 * 
 */
int main(int argc, char** argv)
{
    cl_int err;
    cl_uint pnum = 0;
    err = clGetPlatformIDs(0, 0, &pnum);
    assert(err == CL_SUCCESS);

    if (pnum <= 0){printf("No OpenCL platform\n"); exit(-1);}
    
    cl_platform_id pid;
    err = clGetPlatformIDs(1, &pid, 0);
    size_t plen = 0;
    err = clGetPlatformInfo(pid, CL_PLATFORM_NAME, 0, 0, &plen);
    assert(err == CL_SUCCESS);

    char pname[plen];
    err = clGetPlatformInfo(pid, CL_PLATFORM_NAME, plen, pname, 0 );
    assert(err == CL_SUCCESS);

    printf("OpenCL Platform Name: %s\n", pname);
    

    cl_uint dnum = 0;
    err = clGetDeviceIDs(pid, CL_DEVICE_TYPE_ALL, 0, 0, &dnum );
    assert(err == CL_SUCCESS);

    if (dnum <= 0){printf("No OpenCL device\n"); exit(-1);}
    
    cl_device_id did;

    err = clGetDeviceIDs(pid, CL_DEVICE_TYPE_ALL, 1, &did, 0 );
    assert(err == CL_SUCCESS);

    size_t dlen = 0;
    err = clGetDeviceInfo(did, CL_DEVICE_NAME, 0, 0, &dlen );
    assert(err == CL_SUCCESS);
    
    char dname[dlen];
    err = clGetDeviceInfo(did, CL_DEVICE_NAME, dlen, dname, 0 );
    assert(err == CL_SUCCESS);
    
    printf("OpenCL Device Name: %s\n", dname);
    
    cl_context_properties cprops[3];
    cprops[0] = CL_CONTEXT_PLATFORM;
    cprops[1] = cl_context_properties(pid);
    cprops[2] = 0;
    
    cl_context context = clCreateContext(&cprops[0], 1, &did, NULL, 0, &err);
    assert(err == CL_SUCCESS);

    
    size_t size = 1024 * 1024 * 100;  // 1 GiB
    
    cl_mem buf_in = clCreateBuffer(context, CL_MEM_READ_WRITE, size * sizeof(long), NULL, &err);
    assert(err == CL_SUCCESS);
   
    cl_mem buf_out = clCreateBuffer(context, CL_MEM_READ_WRITE, size * sizeof(long), NULL, &err);
    assert(err == CL_SUCCESS);

    cl_command_queue queue = clCreateCommandQueue(context, did, NULL, &err);
    assert(err == CL_SUCCESS);

    const char* sources[] = {gCLSource};
    const size_t sourcesLen[] = {strlen(gCLSource)};
    
    printf("PROGRAM:\n%s\n", gCLSource);
    
    cl_program program = clCreateProgramWithSource(context, 1, sources, sourcesLen, &err);
    assert(err == CL_SUCCESS);

    const char* options = ""; // -cl-nv-verbose"; 

    cl_device_id dids[] = {did};
    
    err = clBuildProgram(program, 1, dids, options, NULL, NULL);
    assert(err == CL_SUCCESS);

    const char* name = "test";
    cl_kernel kernel = clCreateKernel(program, name, &err);
    assert(err == CL_SUCCESS);

    //err = clEnqueueWriteBuffer(queue, buf, CL_TRUE, 0, size, dst, 0, NULL, NULL);
    //assert(err == CL_SUCCESS);

    err = clSetKernelArg(kernel, 0, sizeof(cl_mem), (void *)&buf_in);
    assert(err == CL_SUCCESS);
    
    err = clSetKernelArg(kernel, 1, sizeof(cl_mem), (void *)&buf_out);
    assert(err == CL_SUCCESS);

    long n = 1000;
    err = clSetKernelArg(kernel, 2, sizeof(cl_long), (void *)&n);
    assert(err == CL_SUCCESS);

    size_t wgSize[3] = {1, 1, 1};
    size_t gSize[3] = {size, 1, 1};

    double t0 = dtime();
    err = clEnqueueNDRangeKernel(queue, kernel, 1, NULL, gSize, wgSize, 0, NULL, NULL);
    assert(err == CL_SUCCESS);

    err = clFinish(queue);
    assert(err == CL_SUCCESS);

    double tf = dtime();
    
    printf("n: %ld Ellapsed: %f\n", n, (tf-t0));
    
    
    return 0;
}

It would be best to avoid explicit comparisons and use built-in function max() instead. From the OpenCL 1.2 spec (6.12.3 Integer Functions), this is also available in OpenCL:

gentype max (gentype x, gentype y)

sure, but the question here is “why the same kernel code takes more time in CUDA than OpenCL?”

Are you suggesting that OpenCL compiler is able to infer the use of the built-in max function while CUDA compiler is not ?

Maybe. I do not know the internals of either compiler. Any number of compiler-internal code transformations could be an underlying cause, idiom recognition being just one of them. Knowing a little bit of the genesis of the two compilers (from my time working at NVIDIA from 2003 to 2014) I would not be surprised if there are numerous differences between the two compilers at this level of detail.

If you deem the code generation of the CUDA compiler sub-optimal in any particular instance, a good way to address this is to file an RFE (enhancement request) with NVIDIA through the bug reporting mechanism, pointing out the machine code that is currently being generated and indicating what should be generated instead.

My assumption here was that, orthogonal to any potential code generation issues, there is also a practical interest in getting the highest performing code for the research paper, as soon as possible, as it seemed like you are gearing up for publication.

Does addressing the max3()-issue remove the entire observed 7x performance difference between OpenCL and CUDA, or are there additional issues?