Comparison vector efficiency questions

I have the following kernel

template <int type>
void __global__ vecop_kernel(float * out, const float * lhs, const float * rhs, int len, const bool *idx)
{
	const int off = threadIdx.x + blockDim.x * blockIdx.x;
	const int skip = gridDim.x * blockDim.x;
	for (int i=off; i<len; i+=skip)
	{
		if (type==0) idx[i] = (lhs[i]>rhs[i]);
		if (type==1) idx[i] = (lhs[i]<rhs[i]);
		if (type==2) idx[i] = (lhs[i]==rhs[i]);
	}
}

Which essentially computes

idx = lhs>rhs
idx = lhs<rhs
or
idx = lhs==rhs
idx,lhs,rhs are all vectors.

Would there be a more efficient way to code the kernel?
Cheers.

I’m not sure how efficient the branch prediction is (if it even exists) on the CUDA cores. It’s worth doing a test where you transpose the if and the for loop:

void __global__ vecop_kernel(float * out, const float * lhs, const float * rhs, int len, const bool *idx)
{
	const int off = threadIdx.x + blockDim.x * blockIdx.x;
	const int skip = gridDim.x * blockDim.x;

        if (type==0)
                for (int i=0; i<len; i+=skip)
                        idx[i] = (lhs[i]>rhs[i]);
        else if (type==1)
                for (int i=0; i<len; i+=skip)
                        idx[i] = (lhs[i]<rhs[i]);
        else if (type==2)
                for (int i=0; i<len; i+=skip)
                        idx[i] = (lhs[i]==rhs[i]);        
}

It might make no significant difference, since I suspect your kernel is memory-bandwidth limited anyway.

Sorry, I forgot to include the first line

template :)
Should get optimized away. More importantly I’m wondering how efficiently I can code binary writes in cuda to properly coalesce them :)

e.g are there bool2s? bool4s? bool8s? bool32s?

Seibert, your kernel won’t work as written. You need the offset in the for loop.

for (int i=off; i< len; i+=skip) ...

Efficiencywise, this kernel is entirely dominated by memory access. It should be close to full speed as is.

You might be able to squeeze out some minor efficiency by making sure that your input and output arrays have nice base address multiples such that a warp reads a full cacheline at once instead of splitting it. This is likely already the case from cudaMalloc(), but it can’t hurt to double check.

I would say that the code

if (type==0) idx[i] = (lhs[i]>rhs[i]);
if (type==1) idx[i] = (lhs[i]<rhs[i]);
if (type==2) idx[i] = (lhs[i]==rhs[i]);

is equivalent to

idx[i] = (type==0)*(lhs[i]>rhs[i])+(type==1)*(lhs[i]<rhs[i])+(type==2)*(lhs[i]==rhs[i]);

which does not involve any branching.

yes, good approach!

Thanks, CudaaduC.

This approach could be perhaps preferable in the general case.

However, in this particular case, vecop_kernel is a template function. For any instance of it, the parameter type is known at compile-time. In principle, the compiler could optimize the code by considering only the branch relevant to the instance at hand. In that case, again in principle, the original user’s solution could paradoxically require less calculations than the one without branchings. I wonder if nvcc will really optimize the template function by avoiding branchings in its instances.