# Comparison vector efficiency questions

I have the following kernel

``````template <int type>
void __global__ vecop_kernel(float * out, const float * lhs, const float * rhs, int len, const bool *idx)
{
const int off = threadIdx.x + blockDim.x * blockIdx.x;
const int skip = gridDim.x * blockDim.x;
for (int i=off; i<len; i+=skip)
{
if (type==0) idx[i] = (lhs[i]>rhs[i]);
if (type==1) idx[i] = (lhs[i]<rhs[i]);
if (type==2) idx[i] = (lhs[i]==rhs[i]);
}
}
``````

Which essentially computes

idx = lhs>rhs
idx = lhs<rhs
or
idx = lhs==rhs
idx,lhs,rhs are all vectors.

Would there be a more efficient way to code the kernel?
Cheers.

I’m not sure how efficient the branch prediction is (if it even exists) on the CUDA cores. It’s worth doing a test where you transpose the if and the for loop:

``````void __global__ vecop_kernel(float * out, const float * lhs, const float * rhs, int len, const bool *idx)
{
const int off = threadIdx.x + blockDim.x * blockIdx.x;
const int skip = gridDim.x * blockDim.x;

if (type==0)
for (int i=0; i<len; i+=skip)
idx[i] = (lhs[i]>rhs[i]);
else if (type==1)
for (int i=0; i<len; i+=skip)
idx[i] = (lhs[i]<rhs[i]);
else if (type==2)
for (int i=0; i<len; i+=skip)
idx[i] = (lhs[i]==rhs[i]);
}
``````

It might make no significant difference, since I suspect your kernel is memory-bandwidth limited anyway.

Sorry, I forgot to include the first line

template :)
Should get optimized away. More importantly I’m wondering how efficiently I can code binary writes in cuda to properly coalesce them :)

e.g are there bool2s? bool4s? bool8s? bool32s?

Seibert, your kernel won’t work as written. You need the offset in the for loop.

``````for (int i=off; i< len; i+=skip) ...
``````

Efficiencywise, this kernel is entirely dominated by memory access. It should be close to full speed as is.

You might be able to squeeze out some minor efficiency by making sure that your input and output arrays have nice base address multiples such that a warp reads a full cacheline at once instead of splitting it. This is likely already the case from cudaMalloc(), but it can’t hurt to double check.

I would say that the code

``````if (type==0) idx[i] = (lhs[i]>rhs[i]);
if (type==1) idx[i] = (lhs[i]<rhs[i]);
if (type==2) idx[i] = (lhs[i]==rhs[i]);
``````

is equivalent to

``````idx[i] = (type==0)*(lhs[i]>rhs[i])+(type==1)*(lhs[i]<rhs[i])+(type==2)*(lhs[i]==rhs[i]);
``````

which does not involve any branching.

yes, good approach!