why cuda is slower than opencl

Chenghaibo · April 1, 2016, 8:15am

there is the code written by opencl

__kernel void compute_gradients_8UC1_kernel(
const int height, const int width,
const int img_step, const int grad_quadstep, const int qangle_step,
__global const uchar * img, __global float * grad, __global QANGLE_TYPE * qangle,
const float angle_scale, const char correct_gamma, const int cnbins)
{
const int x = get_global_id(0);
const int tid = get_local_id(0);
const int gSizeX = get_local_size(0);
const int gidY = get_group_id(1);

__global const uchar* row = img + gidY * img_step;

//__local float sh_row[NTHREADS + 2];
__local float sh_row[128 + 2];

if (x < width)
	sh_row[tid + 1] = row[x];
else
	sh_row[tid + 1] = row[width - 2];

if (tid == 0)
	sh_row[0] = row[(int)max(x - 1, 1)];

if (tid == gSizeX - 1)
	sh_row[gSizeX + 1] = row[(int)min(x + 1, width - 2)];

barrier(CLK_LOCAL_MEM_FENCE);
if (x < width)
{
	float dx;

	if (correct_gamma == 1)
		dx = sqrt(sh_row[tid + 2]) - sqrt(sh_row[tid]);
	else
		dx = sh_row[tid + 2] - sh_row[tid];

	float dy = 0.f;
	if (gidY > 0 && gidY < height - 1)
	{
		float a = (float)img[(gidY + 1) * img_step + x];
		float b = (float)img[(gidY - 1) * img_step + x];
		if (correct_gamma == 1)
			dy = sqrt(a) - sqrt(b);
		else
			dy = a - b;
	}
	float mag = sqrt(dx * dx + dy * dy);

	float ang = (atan2(dy, dx) + CV_PI_F) * angle_scale - 0.5f;
	int hidx = (int)floor(ang);
	ang -= hidx;
	hidx = (hidx + cnbins) % cnbins;

	qangle[(gidY * qangle_step + x) << 1] = hidx;
	qangle[((gidY * qangle_step + x) << 1) + 1] = (hidx + 1) % cnbins;
	grad[(gidY * grad_quadstep + x) << 1] = mag * (1.f - ang);
	grad[((gidY * grad_quadstep + x) << 1) + 1] = mag * ang;
}

}

and that is written by cuda

template <int nthreads, int correct_gamma>
global void compute_gradients_8UC1_kernel(int height, int width, const PtrStepb img,
float angle_scale, PtrStepf grad, PtrStepb qangle)
{
const int x = blockIdx.x * blockDim.x + threadIdx.x;
int local_x = threadIdx.x;

        const unsigned char* row = (const unsigned char*)img.ptr(blockIdx.y);

        __shared__ float sh_row[128 + 2];

        if (x < width)
            sh_row[local_x + 1] = row[x];
        else
            sh_row[local_x + 1] = row[width - 2];

		if (local_x == 0)
            sh_row[0] = row[::max(x - 1, 1)];

		if (local_x == blockDim.x - 1)
            sh_row[blockDim.x + 1] = row[::min(x + 1, width - 2)];

        __syncthreads();
        if (x < width)
        {
            float dx;

            if (correct_gamma)
				dx = __fsqrt_rz(sh_row[local_x + 2]) - __fsqrt_rz(sh_row[local_x]);
            else
				dx = sh_row[local_x + 2] - sh_row[local_x];

            float dy = 0.f;
            if (blockIdx.y > 0 && blockIdx.y < height - 1)
            {
                float a = ((const unsigned char*)img.ptr(blockIdx.y + 1))[x];
                float b = ((const unsigned char*)img.ptr(blockIdx.y - 1))[x];
                if (correct_gamma)
					dy = __fsqrt_rz(a) - __fsqrt_rz(b);
                else
                    dy = a - b;
            }
			float mag = __fsqrt_rz(dx * dx + dy * dy);

            float ang = (atan2f(dy, dx) + CV_PI_F) * angle_scale - 0.5f;
            int hidx = (int)floorf(ang);
            ang -= hidx;
            hidx = (hidx + cnbins) % cnbins;

            ((uchar2*)qangle.ptr(blockIdx.y))[x] = make_uchar2(hidx, (hidx + 1) % cnbins);
            ((float2*)  grad.ptr(blockIdx.y))[x] = make_float2(mag * (1.f - ang), mag * ang);
        }
    }

: use a picture (768 * 576) test them

the opencl used only 2.82ms, but the cuda used about 8 ms, why?

cbuchner1 · April 1, 2016, 9:25am

running a debug build in CUDA, maybe?

Chenghaibo · April 1, 2016, 9:44am

both of them are tested in debug mode, those code segments are copy from obj detect of hog belong to opencv, the difference between them when compute the hist, is so huge , the cuda will use ten times delay than opencl, i think to find the reason will be a hard work

cbuchner1 · April 1, 2016, 10:10am

You need to time release builds. Code generated in debug mode (-g -O0) tends to be unoptimized and uses a lot more global memory access than necessary.

njuffa · April 1, 2016, 10:10am

Generally speaking, performance comparisons using debug builds of an application are meaningless, on any platform.

BTW, the OpenCL and the CUDA code here do not seem to be functionally identical. Where the OpenCL code has sqrt the CUDA code has __fsqrt_rz. Why the use of __fsqrt_rz?

Chenghaibo · April 5, 2016, 8:06am

thank you very much, njuffa, by compile th .cu file with option -Xcompiler "/EHsc /W3 /nologo /O2 /Zi /MD " , simply by choice the full Optimization /Ox, cuda is faster than opencl now.the question,why i use the __fsqrt_rz ? that is, i think the compiler will use the machine instruction to instead it.

njuffa · April 5, 2016, 9:00am

The intrinsics __fsqrt_r{n,z,u,d} are implementations of single-precision square root that provide results properly rounded according to one of the four rounding modes defined by the original IEEE 754 floating-point standard. The cost of all four variants is roughly the same.

There is no machine instruction for square root in current NVIDIA GPUs. The closest you can get is to compile your code with -prec-sqrt=false, which causes single-precision square root sqrtf() to be approximate rather than correctly rounded. The approximate square root is based on the machine instructions MUFU.RCP and MUFU.RSQ, and possibly a few additional machine instructions, depending on whether the code is compiled with -ftz=true of -ftz=false.

Chenghaibo · April 6, 2016, 1:27am

professional answer

Topic		Replies	Views
Significant speedup of OpenCL vs CUDA CUDA Programming and Performance	23	9488	February 12, 2022
Significant speed gap between CUDA and OpenCL - how to debug? CUDA Programming and Performance	3	7691	January 28, 2018
Why CUDA slower that OpenCL? CUDA Programming and Performance	5	1607	September 12, 2018
Same Implementation in CUDA and OpenCL but different performance, and OpenCL Faster? CUDA Programming and Performance	2	1291	October 11, 2013
OpenCL vs Cuda performance on same kernels CUDA Programming and Performance	13	55865	July 15, 2010
Float accuracy : OpenCL and CUDA CUDA Programming and Performance	4	3495	August 5, 2010
Cuda OpenCL comparison cuda, openCL, nvidia CUDA Programming and Performance	19	43014	November 1, 2012
Performance comparison of CUDA and OpenCL CUDA Programming and Performance	2	1150	June 3, 2016
OpenCL kernel vs CUDA kernel why so different? I see very different performance for almost similar k CUDA Programming and Performance	1	15613	April 14, 2011
OpenCL or CUDA? CUDA Programming and Performance	16	11144	October 26, 2011

why cuda is slower than opencl

Related topics