Cuda array loop. Differences between two implementations

Hello everyone!

I’d like to know the differences between the two following implementation to go through an array.
considering both functions are called with this:

array << <blocks, threads, 0, 0>> >(input, tab_size);

Implem1

static global void array(
int* input,
const unsigned int input_size)
{
unsigned int index = blockIdx.x * blockDim.x + threadIdx.x;

while (index < input_size)
{
input[index] = 1 // apply something on the array
index += blockDim.x * gridDim.x;
}
}

Implem2

static global void array(
int* input,
const unsigned int input_size)
{
unsigned int index = blockIdx.x * blockDim.x + threadIdx.x;

if (index < input_size)
{
input[index] = 1 // apply something on the arry
}
}

I already know that both works but I’d like to know if one of the implementation is more efficient than another and more importantly why. I did few benchmarks but the result seems quite similar.
Considering that the application I’m developing is very resource consuming every performance improvement (even small) is very important.

Thanks,
Kawa

This link may be relevant.
[url]https://devblogs.nvidia.com/parallelforall/cuda-pro-tip-write-flexible-kernels-grid-stride-loops/[/url]

Well thanks to share this article, it answers the question pretty well!

Sincerely,
Kawa