why cuda is slower than opencl

there is the code written by opencl

__kernel void compute_gradients_8UC1_kernel(
const int height, const int width,
const int img_step, const int grad_quadstep, const int qangle_step,
__global const uchar * img, __global float * grad, __global QANGLE_TYPE * qangle,
const float angle_scale, const char correct_gamma, const int cnbins)
{
const int x = get_global_id(0);
const int tid = get_local_id(0);
const int gSizeX = get_local_size(0);
const int gidY = get_group_id(1);

__global const uchar* row = img + gidY * img_step;

//__local float sh_row[NTHREADS + 2];
__local float sh_row[128 + 2];

if (x < width)
	sh_row[tid + 1] = row[x];
else
	sh_row[tid + 1] = row[width - 2];

if (tid == 0)
	sh_row[0] = row[(int)max(x - 1, 1)];

if (tid == gSizeX - 1)
	sh_row[gSizeX + 1] = row[(int)min(x + 1, width - 2)];

barrier(CLK_LOCAL_MEM_FENCE);
if (x < width)
{
	float dx;

	if (correct_gamma == 1)
		dx = sqrt(sh_row[tid + 2]) - sqrt(sh_row[tid]);
	else
		dx = sh_row[tid + 2] - sh_row[tid];

	float dy = 0.f;
	if (gidY > 0 && gidY < height - 1)
	{
		float a = (float)img[(gidY + 1) * img_step + x];
		float b = (float)img[(gidY - 1) * img_step + x];
		if (correct_gamma == 1)
			dy = sqrt(a) - sqrt(b);
		else
			dy = a - b;
	}
	float mag = sqrt(dx * dx + dy * dy);

	float ang = (atan2(dy, dx) + CV_PI_F) * angle_scale - 0.5f;
	int hidx = (int)floor(ang);
	ang -= hidx;
	hidx = (hidx + cnbins) % cnbins;

	qangle[(gidY * qangle_step + x) << 1] = hidx;
	qangle[((gidY * qangle_step + x) << 1) + 1] = (hidx + 1) % cnbins;
	grad[(gidY * grad_quadstep + x) << 1] = mag * (1.f - ang);
	grad[((gidY * grad_quadstep + x) << 1) + 1] = mag * ang;
}

}

and that is written by cuda

template <int nthreads, int correct_gamma>
global void compute_gradients_8UC1_kernel(int height, int width, const PtrStepb img,
float angle_scale, PtrStepf grad, PtrStepb qangle)
{
const int x = blockIdx.x * blockDim.x + threadIdx.x;
int local_x = threadIdx.x;

        const unsigned char* row = (const unsigned char*)img.ptr(blockIdx.y);

        __shared__ float sh_row[128 + 2];

        if (x < width)
            sh_row[local_x + 1] = row[x];
        else
            sh_row[local_x + 1] = row[width - 2];

		if (local_x == 0)
            sh_row[0] = row[::max(x - 1, 1)];

		if (local_x == blockDim.x - 1)
            sh_row[blockDim.x + 1] = row[::min(x + 1, width - 2)];

        __syncthreads();
        if (x < width)
        {
            float dx;

            if (correct_gamma)
				dx = __fsqrt_rz(sh_row[local_x + 2]) - __fsqrt_rz(sh_row[local_x]);
            else
				dx = sh_row[local_x + 2] - sh_row[local_x];

            float dy = 0.f;
            if (blockIdx.y > 0 && blockIdx.y < height - 1)
            {
                float a = ((const unsigned char*)img.ptr(blockIdx.y + 1))[x];
                float b = ((const unsigned char*)img.ptr(blockIdx.y - 1))[x];
                if (correct_gamma)
					dy = __fsqrt_rz(a) - __fsqrt_rz(b);
                else
                    dy = a - b;
            }
			float mag = __fsqrt_rz(dx * dx + dy * dy);

            float ang = (atan2f(dy, dx) + CV_PI_F) * angle_scale - 0.5f;
            int hidx = (int)floorf(ang);
            ang -= hidx;
            hidx = (hidx + cnbins) % cnbins;

            ((uchar2*)qangle.ptr(blockIdx.y))[x] = make_uchar2(hidx, (hidx + 1) % cnbins);
            ((float2*)  grad.ptr(blockIdx.y))[x] = make_float2(mag * (1.f - ang), mag * ang);
        }
    }

: use a picture (768 * 576) test them

the opencl used only 2.82ms, but the cuda used about 8 ms, why?

running a debug build in CUDA, maybe?

both of them are tested in debug mode, those code segments are copy from obj detect of hog belong to opencv, the difference between them when compute the hist, is so huge , the cuda will use ten times delay than opencl, i think to find the reason will be a hard work

You need to time release builds. Code generated in debug mode (-g -O0) tends to be unoptimized and uses a lot more global memory access than necessary.

Generally speaking, performance comparisons using debug builds of an application are meaningless, on any platform.

BTW, the OpenCL and the CUDA code here do not seem to be functionally identical. Where the OpenCL code has sqrt the CUDA code has __fsqrt_rz. Why the use of __fsqrt_rz?

thank you very much, njuffa, by compile th .cu file with option -Xcompiler "/EHsc /W3 /nologo /O2 /Zi /MD " , simply by choice the full Optimization /Ox, cuda is faster than opencl now.the question,why i use the __fsqrt_rz ? that is, i think the compiler will use the machine instruction to instead it.

The intrinsics __fsqrt_r{n,z,u,d} are implementations of single-precision square root that provide results properly rounded according to one of the four rounding modes defined by the original IEEE 754 floating-point standard. The cost of all four variants is roughly the same.

There is no machine instruction for square root in current NVIDIA GPUs. The closest you can get is to compile your code with -prec-sqrt=false, which causes single-precision square root sqrtf() to be approximate rather than correctly rounded. The approximate square root is based on the machine instructions MUFU.RCP and MUFU.RSQ, and possibly a few additional machine instructions, depending on whether the code is compiled with -ftz=true of -ftz=false.

professional answer