Significant speedup of OpenCL vs CUDA

It wasn’t what I was looking for, however. I was looking for -G. I agree it’s not conclusive to look at a snippet of a build “makefile”. Better to see the actual compilation command.

I agree it is best (most conclusive) to see the actual compiler component invocations.

My logic here is that where there is a -g in CUDA compilation, a -G is typically right behind it. And I don’t think I have ever encountered a really large performance difference like this without an inadvertent comparison of debug with release build.

FWIW, I am not sure what the -O4 is supposed to do. Best I know, that is not something nvcc can make use of. Maybe -Xcompiler -O4 was intended?

I included some of your suggestions, still the same behaviour.
Here is the actual invocation

/usr/local/cuda/bin/nvcc -Xcompiler -O4 -g -gencode arch=compute_86,code=sm_86 -use_fast_math -I . -c app.cpp -o app.o

/usr/local/cuda/bin/nvcc -Xcompiler -O4 -g -gencode arch=compute_86,code=sm_86 -use_fast_math app.o -o app

The data provided so far does not make sense to me. I don’t have a mental model that would explain the observation. Maybe Robert Crovella has some additional ideas.

What happens if you remove -g?
What happens if you change -Xcompiler -O4 to -Xcompiler -O3 and -Xcompiler -O2?

Same behaviour.
I understand it’s hard to tell without looking at code.
I will try to replicate the behaviour with an example I can share…
I will take me some days

Well, at least that confirms that the performance of host code is highly unlikely to play any role in this issue, and that it all comes down to what is happening on the GPU.

That couldn’t possibly contain any CUDA device code.

sorry, it’s app.cu… I avoided to write the original file names

I think there is a decent chance you will discover the root cause for your observation as you work on isolating and reducing the relevant code into a minimal self-contained example.

The intriguing part for me is that the code (based on the data made available) is apparently bottlenecked on integer-intensive computation. Other than the issue of integer types that I raised (GPUs are 32-bit architectures and 64-bit integer operations emulated, so gratuitous 64-bit integer computation should be avoided) I cannot think of anything in that area that would be affected by OpenCL / CUDA differences.

It looks like I’ve been able to replicate the behaviour in a much simpler kernel.
Find it below…

I started doing accumulation over a range of values, then selecting the max using a max2 macro, and finally using a max3 macro.

CUDA is typically faster than OpenCL … until I use the max3 macro, then it turns to be much slower than OpenCL.

In an RTX3080 the execution time difference I get is: 2.5 (OpenCL) vs 7.4 (CUDA)

#include <cstdlib>
#include <assert.h>
#include <CL/cl.h>
#include <stdio.h>
#include <string.h>
#include <sys/time.h>


// compile using 
// /usr/local/cuda/bin/nvcc -g -Xcompiler -O3 -gencode arch=compute_86,code=sm_86 -use_fast_math test_cuda_kernel.cu -o test_cuda_kernel

#define max2(a,b) ((a)>(b))? (a):(b)
#define max3(a,b,c) max2((a), max2((b), (c)))

__global__  void test(long* pin, long* pout, long n)
{
   long gid = blockIdx.x;

   long sum = 0;
   for (long i = 0; i < n; i++)
   {
       long idx = gid - n;
       long idx2 = idx +1;
       if (idx > 0 && idx2 < gid)
           sum = max3(sum, pin[idx], pin[idx2]);
   }
   pout[gid] = sum;
}

struct timeval tnow;

double dtime(){
    gettimeofday(&tnow, NULL);
    return (double)tnow.tv_sec + (double)tnow.tv_usec * 1.0e-6;}

int main(int argc, char** argv)
{
    size_t size = 1024 * 1024 * 100;  // 1 GiB
    long* pin;
    long* pout;    
    cudaMalloc(&pin,  size * sizeof(long));
    cudaMalloc(&pout, size * sizeof(long));
    long n = 1000;    
    double t0 = dtime();
    test<<<size, 1>>>(pin, pout, n);    
    cudaDeviceSynchronize();            
    double tf = dtime();
    printf("n: %ld Ellapsed: %f\n", n, (tf-t0));
    
    return 0;
}

Surprisingly using the following inlined function (fmax3) instead of the macro reduces the execution time to 4.9 (CUDA)

__device__ __forceinline__ long fmax3(long a, long b, long c)
{
    long r = a;
    if (b > r)
        r = b;
    if (c > r)
        r = c;
    return r;
}

These behaviors seem strange to me. Maybe I’m missing something. I don’t think the possible differences in long types (OpenCL vs CUDA) can be the reason, as they were not affecting at all when using using a single max2 macro. Does anyone have a reasonable explanation?

FWIW this:

test<<<size, 1>>>

is not the way to write performant CUDA code.

Regarding this:

According to my testing the max3 and fmax3 test cases are equivalent performance-wise if I define the max3 macro like this:

#define max3(a,b,c) max2((c), max2((b), (a)))

(I’m not suggesting that is a detailed, in-depth explanation, mostly just an observation.)

If I were looking for insight into the OpenCL/CUDA perf difference, I would need to see the equivalent OpenCL test case also.

There aren’t any differences in the long datatype on linux between CUDA and OpenCL that I am aware of.

Friendly but urgent advice: Do not use long in your CUDA programs, ever.

CUDA maintains compatibility of types between host and device. This allows straightforward creation of __host__ __device__ code. On 64-bit Windows platforms, long is a 32-bit data type. On 64-bit Linux platforms, long is a 64-bit data type. Nasty surprises are almost guaranteed if CUDA code using long needs to be moved between Windows and Linux platforms. Use int and long long int (and their unsigned counterparts) as appropriate.

Per the OpenCL 1.2 specification (6.1.1 Built-in Scalar Data Types), long is always a 64-bit type. OpenCL has its own type system that is independent of the host platform.

General note on performance: Because GPUs use 32-bit architectures (with minimal extensions to handle 64-bit addressing), one would not want to use 64-bit integers unless one has to. All 64-bit integer operations with the exception of conversions from and to floating-point types are emulated.

I do not have time to dive into the details of code generation right now. From a first glance, using an older version of CUDA (and thus the CUDA toolchain), the two variants, max3() macro and fmax3() inline function, generate quite different code and the code generated from the macro looks like … worthy of improvement. It would probably be worthwhile to perform a detailed comparison using the latest CUDA toolchain, and (depending on the result), pass this to the CUDA compiler team as an enhancement request.

Since max() is a built-in function in CUDA equivalent to std::max in regular C++, I would suggest coding:

#define max3(a,b,c) max(max(a,b),c)

If the pin[] are restricted in range (e.g. always positive), it might be possible to craft a custom version of max() that is faster than the built-in function. The following general-purpose implementation may also be worth trying:

__device__ long long int my_max (long long int a, long long int b)
{
    return ((a ^ b) & (- (a >= b))) ^ b;
}

[Later:]

Using a recent toolchain, I don’t see any inefficiency in the generated code when using CUDA’s built-in max() for max3(). I tried both int and long long int for pin, pout.

1 Like

You are right, the difference between 4.9 and 7.4 is caused by the order of the comparisons.

About the block size, I’ve tried to remove grid issues from the comparison.
I assume OpenCL’s “workgroup size” is equivalent to CUDA’s “block size”.

Here you have the OpenCL code

// /usr/local/cuda/bin/nvcc -g -Xcompiler -O3 -gencode arch=compute_86,code=sm_86 -use_fast_math test_opencl_kernel.cpp -l OpenCL -o test_opencl_kernel

#include <cstdlib>
#include <assert.h>
#include <CL/cl.h>
#include <stdio.h>
#include <string.h>
#include <sys/time.h>

using namespace std;

const char* gCLSource = 
"#define max2(a,b) ((a)>(b))? (a):(b)\n"
"#define max3(a,b,c) max2((a), max2((b), (c)))\n"
"__kernel void test(\n"
"   __global long* pin,\n"
"   __global long* pout,\n"
"   long n)\n"
"{\n"
"   long gid = get_global_id(0);\n"
// "   printf(\"gid: %d\", gid);"
"\n"
"   long sum = 0;\n"
"   for (long i = 0; i < n; i++)\n"
"   {\n"
"       long idx = gid - n;\n"
"       long idx2 = idx +1;"
"       if (idx > 0 && idx2 < gid)\n"
"           sum = max3(sum, pin[idx], pin[idx2]);\n"
"   }\n"
"   pout[gid] = sum;\n"
"}\n";

struct timeval tnow;

double dtime()
{
    gettimeofday(&tnow, NULL);
    return (double)tnow.tv_sec + (double)tnow.tv_usec * 1.0e-6;
}

/*
 * 
 */
int main(int argc, char** argv)
{
    cl_int err;
    cl_uint pnum = 0;
    err = clGetPlatformIDs(0, 0, &pnum);
    assert(err == CL_SUCCESS);

    if (pnum <= 0){printf("No OpenCL platform\n"); exit(-1);}
    
    cl_platform_id pid;
    err = clGetPlatformIDs(1, &pid, 0);
    size_t plen = 0;
    err = clGetPlatformInfo(pid, CL_PLATFORM_NAME, 0, 0, &plen);
    assert(err == CL_SUCCESS);

    char pname[plen];
    err = clGetPlatformInfo(pid, CL_PLATFORM_NAME, plen, pname, 0 );
    assert(err == CL_SUCCESS);

    printf("OpenCL Platform Name: %s\n", pname);
    

    cl_uint dnum = 0;
    err = clGetDeviceIDs(pid, CL_DEVICE_TYPE_ALL, 0, 0, &dnum );
    assert(err == CL_SUCCESS);

    if (dnum <= 0){printf("No OpenCL device\n"); exit(-1);}
    
    cl_device_id did;

    err = clGetDeviceIDs(pid, CL_DEVICE_TYPE_ALL, 1, &did, 0 );
    assert(err == CL_SUCCESS);

    size_t dlen = 0;
    err = clGetDeviceInfo(did, CL_DEVICE_NAME, 0, 0, &dlen );
    assert(err == CL_SUCCESS);
    
    char dname[dlen];
    err = clGetDeviceInfo(did, CL_DEVICE_NAME, dlen, dname, 0 );
    assert(err == CL_SUCCESS);
    
    printf("OpenCL Device Name: %s\n", dname);
    
    cl_context_properties cprops[3];
    cprops[0] = CL_CONTEXT_PLATFORM;
    cprops[1] = cl_context_properties(pid);
    cprops[2] = 0;
    
    cl_context context = clCreateContext(&cprops[0], 1, &did, NULL, 0, &err);
    assert(err == CL_SUCCESS);

    
    size_t size = 1024 * 1024 * 100;  // 1 GiB
    
    cl_mem buf_in = clCreateBuffer(context, CL_MEM_READ_WRITE, size * sizeof(long), NULL, &err);
    assert(err == CL_SUCCESS);
   
    cl_mem buf_out = clCreateBuffer(context, CL_MEM_READ_WRITE, size * sizeof(long), NULL, &err);
    assert(err == CL_SUCCESS);

    cl_command_queue queue = clCreateCommandQueue(context, did, NULL, &err);
    assert(err == CL_SUCCESS);

    const char* sources[] = {gCLSource};
    const size_t sourcesLen[] = {strlen(gCLSource)};
    
    printf("PROGRAM:\n%s\n", gCLSource);
    
    cl_program program = clCreateProgramWithSource(context, 1, sources, sourcesLen, &err);
    assert(err == CL_SUCCESS);

    const char* options = ""; // -cl-nv-verbose"; 

    cl_device_id dids[] = {did};
    
    err = clBuildProgram(program, 1, dids, options, NULL, NULL);
    assert(err == CL_SUCCESS);

    const char* name = "test";
    cl_kernel kernel = clCreateKernel(program, name, &err);
    assert(err == CL_SUCCESS);

    //err = clEnqueueWriteBuffer(queue, buf, CL_TRUE, 0, size, dst, 0, NULL, NULL);
    //assert(err == CL_SUCCESS);

    err = clSetKernelArg(kernel, 0, sizeof(cl_mem), (void *)&buf_in);
    assert(err == CL_SUCCESS);
    
    err = clSetKernelArg(kernel, 1, sizeof(cl_mem), (void *)&buf_out);
    assert(err == CL_SUCCESS);

    long n = 1000;
    err = clSetKernelArg(kernel, 2, sizeof(cl_long), (void *)&n);
    assert(err == CL_SUCCESS);

    size_t wgSize[3] = {1, 1, 1};
    size_t gSize[3] = {size, 1, 1};

    double t0 = dtime();
    err = clEnqueueNDRangeKernel(queue, kernel, 1, NULL, gSize, wgSize, 0, NULL, NULL);
    assert(err == CL_SUCCESS);

    err = clFinish(queue);
    assert(err == CL_SUCCESS);

    double tf = dtime();
    
    printf("n: %ld Ellapsed: %f\n", n, (tf-t0));
    
    
    return 0;
}

It would be best to avoid explicit comparisons and use built-in function max() instead. From the OpenCL 1.2 spec (6.12.3 Integer Functions), this is also available in OpenCL:

gentype max (gentype x, gentype y)

sure, but the question here is “why the same kernel code takes more time in CUDA than OpenCL?”

Are you suggesting that OpenCL compiler is able to infer the use of the built-in max function while CUDA compiler is not ?

Maybe. I do not know the internals of either compiler. Any number of compiler-internal code transformations could be an underlying cause, idiom recognition being just one of them. Knowing a little bit of the genesis of the two compilers (from my time working at NVIDIA from 2003 to 2014) I would not be surprised if there are numerous differences between the two compilers at this level of detail.

If you deem the code generation of the CUDA compiler sub-optimal in any particular instance, a good way to address this is to file an RFE (enhancement request) with NVIDIA through the bug reporting mechanism, pointing out the machine code that is currently being generated and indicating what should be generated instead.

My assumption here was that, orthogonal to any potential code generation issues, there is also a practical interest in getting the highest performing code for the research paper, as soon as possible, as it seemed like you are gearing up for publication.

Does addressing the max3()-issue remove the entire observed 7x performance difference between OpenCL and CUDA, or are there additional issues?

It doesn’t seem to, according to my testing. It certainly makes an improvement, however.

According to my testing, the performance deficit on the CUDA side is eliminated if we supply an appropriate pragma unroll directive, and modify the order of parameters of the fmax3 function:

$ cat t1966.cu
#include <cstdlib>
#include <assert.h>
#include <CL/cl.h>
#include <stdio.h>
#include <string.h>
#include <sys/time.h>


// compile using
// /usr/local/cuda/bin/nvcc -g -Xcompiler -O3 -gencode arch=compute_86,code=sm_86 -use_fast_math test_cuda_kernel.cu -o test_cuda_kernel
#ifdef OPT
__device__ __forceinline__ long fmax3(long c, long b, long a)
#else
__device__ __forceinline__ long fmax3(long a, long b, long c)
#endif
{
    long r = a;
    if (b > r)
        r = b;
    if (c > r)
        r = c;
    return r;
}

#define max2(a,b) ((a)>(b))? (a):(b)
#define max3(a,b,c) max2((c), max2((b), (a)))

__global__  void test(long* pin, long* pout, long n)
{
   long gid = blockIdx.x;

   long sum = 0;
#ifdef OPT
#pragma unroll 10
#endif
   for (long i = 0; i < n; i++)
   {
       long idx = gid - n;
       long idx2 = idx +1;
       if (idx > 0 && idx2 < gid)
           sum = fmax3(sum, pin[idx], pin[idx2]);
   }
   pout[gid] = sum;
}

struct timeval tnow;

double dtime(){
    gettimeofday(&tnow, NULL);
    return (double)tnow.tv_sec + (double)tnow.tv_usec * 1.0e-6;}

int main(int argc, char** argv)
{
    size_t size = 1024 * 1024 * 100;  // 1 GiB
    long* pin;
    long* pout;
    cudaMalloc(&pin,  size * sizeof(long));
    cudaMalloc(&pout, size * sizeof(long));
    long n = 1000;
    double t0 = dtime();
    test<<<size, 1>>>(pin, pout, n);
    cudaDeviceSynchronize();
    double tf = dtime();
    printf("n: %ld Ellapsed: %f\n", n, (tf-t0));

    return 0;
}
$ nvcc -arch=sm_70 -o t1966 t1966.cu
$ ./t1966
n: 1000 Ellapsed: 3.807186
$ nvcc -arch=sm_70 -o t1966 t1966.cu -DOPT
$ ./t1966
n: 1000 Ellapsed: 1.170802
$

(Tesla V100, CUDA 11.4, CentOS 7, 470.57.02)
(I don’t know if this has a similar effect on the cc8.6 or cc7.5 GPUs that OP is using.)

The selection of the pragma unroll value seems to be important. I tried values of 4, 5, 6, 8, 10, 12, 13, 14, 15, 20, and 100. 13 - 100 provided no significant benefit. 4 - 8 are almost as good as 10 and 12, which were the best (12 seems a few percent better than 10).

It might be worthwhile to file a bug for this. I’m not sure what could be done to make this outcome achievable without this additional coaxing, but the OpenCL example certainly makes it seem possible. There may be several things going on here.

Here’s my equivalent run of the provided OpenCL code on the same machine for reference:

$ cat t1967.cpp
// /usr/local/cuda/bin/nvcc -g -Xcompiler -O3 -gencode arch=compute_86,code=sm_86 -use_fast_math test_opencl_kernel.cpp -l OpenCL -o test_opencl_kernel

#include <cstdlib>
#include <assert.h>
#include <CL/cl.h>
#include <stdio.h>
#include <string.h>
#include <sys/time.h>

using namespace std;

const char* gCLSource =
"#define max2(a,b) ((a)>(b))? (a):(b)\n"
"#define max3(a,b,c) max2((a), max2((b), (c)))\n"
"__kernel void test(\n"
"   __global long* pin,\n"
"   __global long* pout,\n"
"   long n)\n"
"{\n"
"   long gid = get_global_id(0);\n"
// "   printf(\"gid: %d\", gid);"
"\n"
"   long sum = 0;\n"
"   for (long i = 0; i < n; i++)\n"
"   {\n"
"       long idx = gid - n;\n"
"       long idx2 = idx +1;"
"       if (idx > 0 && idx2 < gid)\n"
"           sum = max3(sum, pin[idx], pin[idx2]);\n"
"   }\n"
"   pout[gid] = sum;\n"
"}\n";

struct timeval tnow;

double dtime()
{
    gettimeofday(&tnow, NULL);
    return (double)tnow.tv_sec + (double)tnow.tv_usec * 1.0e-6;
}

/*
 *
 */
int main(int argc, char** argv)
{
    cl_int err;
    cl_uint pnum = 0;
    err = clGetPlatformIDs(0, 0, &pnum);
    assert(err == CL_SUCCESS);

    if (pnum <= 0){printf("No OpenCL platform\n"); exit(-1);}

    cl_platform_id pid;
    err = clGetPlatformIDs(1, &pid, 0);
    size_t plen = 0;
    err = clGetPlatformInfo(pid, CL_PLATFORM_NAME, 0, 0, &plen);
    assert(err == CL_SUCCESS);

    char pname[plen];
    err = clGetPlatformInfo(pid, CL_PLATFORM_NAME, plen, pname, 0 );
    assert(err == CL_SUCCESS);

    printf("OpenCL Platform Name: %s\n", pname);


    cl_uint dnum = 0;
    err = clGetDeviceIDs(pid, CL_DEVICE_TYPE_ALL, 0, 0, &dnum );
    assert(err == CL_SUCCESS);

    if (dnum <= 0){printf("No OpenCL device\n"); exit(-1);}

    cl_device_id did;

    err = clGetDeviceIDs(pid, CL_DEVICE_TYPE_ALL, 1, &did, 0 );
    assert(err == CL_SUCCESS);

    size_t dlen = 0;
    err = clGetDeviceInfo(did, CL_DEVICE_NAME, 0, 0, &dlen );
    assert(err == CL_SUCCESS);

    char dname[dlen];
    err = clGetDeviceInfo(did, CL_DEVICE_NAME, dlen, dname, 0 );
    assert(err == CL_SUCCESS);

    printf("OpenCL Device Name: %s\n", dname);

    cl_context_properties cprops[3];
    cprops[0] = CL_CONTEXT_PLATFORM;
    cprops[1] = cl_context_properties(pid);
    cprops[2] = 0;

    cl_context context = clCreateContext(&cprops[0], 1, &did, NULL, 0, &err);
    assert(err == CL_SUCCESS);


    size_t size = 1024 * 1024 * 100;  // 1 GiB

    cl_mem buf_in = clCreateBuffer(context, CL_MEM_READ_WRITE, size * sizeof(long), NULL, &err);
    assert(err == CL_SUCCESS);

    cl_mem buf_out = clCreateBuffer(context, CL_MEM_READ_WRITE, size * sizeof(long), NULL, &err);
    assert(err == CL_SUCCESS);
    cl_command_queue_properties prop = {0};
    cl_command_queue queue = clCreateCommandQueue(context, did, prop, &err);
    assert(err == CL_SUCCESS);

    const char* sources[] = {gCLSource};
    const size_t sourcesLen[] = {strlen(gCLSource)};

    printf("PROGRAM:\n%s\n", gCLSource);

    cl_program program = clCreateProgramWithSource(context, 1, sources, sourcesLen, &err);
    assert(err == CL_SUCCESS);

    const char* options = ""; // -cl-nv-verbose";

    cl_device_id dids[] = {did};

    err = clBuildProgram(program, 1, dids, options, NULL, NULL);
    assert(err == CL_SUCCESS);

    const char* name = "test";
    cl_kernel kernel = clCreateKernel(program, name, &err);
    assert(err == CL_SUCCESS);

    //err = clEnqueueWriteBuffer(queue, buf, CL_TRUE, 0, size, dst, 0, NULL, NULL);
    //assert(err == CL_SUCCESS);

    err = clSetKernelArg(kernel, 0, sizeof(cl_mem), (void *)&buf_in);
    assert(err == CL_SUCCESS);

    err = clSetKernelArg(kernel, 1, sizeof(cl_mem), (void *)&buf_out);
    assert(err == CL_SUCCESS);

    long n = 1000;
    err = clSetKernelArg(kernel, 2, sizeof(cl_long), (void *)&n);
    assert(err == CL_SUCCESS);

    size_t wgSize[3] = {1, 1, 1};
    size_t gSize[3] = {size, 1, 1};

    double t0 = dtime();
    err = clEnqueueNDRangeKernel(queue, kernel, 1, NULL, gSize, wgSize, 0, NULL, NULL);
    assert(err == CL_SUCCESS);

    err = clFinish(queue);
    assert(err == CL_SUCCESS);

    double tf = dtime();

    printf("n: %ld Ellapsed: %f\n", n, (tf-t0));


    return 0;
}
$ nvcc t1967.cpp -o t1967 -lOpenCL
$ ./t1967
OpenCL Platform Name: NVIDIA CUDA
OpenCL Device Name: Tesla V100-PCIE-32GB
PROGRAM:
#define max2(a,b) ((a)>(b))? (a):(b)
#define max3(a,b,c) max2((a), max2((b), (c)))
__kernel void test(
   __global long* pin,
   __global long* pout,
   long n)
{
   long gid = get_global_id(0);

   long sum = 0;
   for (long i = 0; i < n; i++)
   {
       long idx = gid - n;
       long idx2 = idx +1;       if (idx > 0 && idx2 < gid)
           sum = max3(sum, pin[idx], pin[idx2]);
   }
   pout[gid] = sum;
}

n: 1000 Ellapsed: 1.672007
$

(I did make a change to the supplied OpenCL code to address a compiler warning, however it has no effect on the observation.)

1 Like

I noticed what I considered unusually massive unrolling in the code generated by the CUDA compiler and was wondering whether that was helpful or harmful.

In the past the CUDA compiler contained multiple “unrollers” which sometimes were in conflict with each other, producing weird code-generation artifacts. I could imagine something like that playing a role here, but speculating about it without analyzing it in detail is not beneficial.

Given the not insignificant performance impact, having the compiler team root cause this would appear worthwhile. That requires filing a bug report as a starting point.

What about the block size? Could it be possible that the OpenCL compiler automatically ends using a block size greater than 1 although the specified “workgroup size” (in OpenCL terms) is 1 ?

I think the short answer is no.

Going beyond that, if you meant simply increasing the workgroup size, without making any other changes, what purpose would that serve? There is no reason to think that alone makes codes run faster. Also, it could easily break code. That would increase the total threads being launched (or workitems, in OpenCL speak.)

If you meant increasing the workgroup size and decreasing the grid size or whatever OpenCL calls it, a few points.

First of all, you don’t trust the language specifications? Where does it say that a compiler is allowed to do that?

Second, this could also break indexing calculations. Probably not a calculation based on get_global_id but that is not the only way to do indexing calculations in OpenCL. That is a convenience function.

It doesn’t seem like we need to speculate that OpenCL is doing something incredible. We already have a data point that shows that the CUDA compiler can be coaxed into a competitive position.

1 Like