# Loop in kernel

In a kernel i do 8 control with if…

If instead of writing 8 if i write a for (with 8 loop) that do the same thing the version with loop is much slower.

Why?

Thanks.

I have no idea what you are asking. Can you post code for the two versions you are comparing?

Sure…

This is the “If version”

__global__ void life(float *A,int N,float* mod) {

int col = blockIdx.x * blockDim.x + threadIdx.x;

int row = blockIdx.y * blockDim.y + threadIdx.y;

int celleVive = 0;

if(col != N-1 && A[row * N + col+1] == 1) {

celleVive++;

}

if(col != 0 && A[row * N + col-1] == 1) {

celleVive++;

}

if(row != N-1 && A[(row+1) * N + col] == 1) {

celleVive++;

}

if(row != 0 && A[(row-1) * N + col] == 1) {

celleVive++;

}

if(row != 0 && col != 0 && A[(row-1) * N + col-1] == 1) {

celleVive++;

}

if(row !=0 && col != N-1 && A[(row-1) * N + col+1] == 1) {

celleVive++;

}

if(row != N-1 && col != 0 && A[(row+1) * N + col-1] == 1) {

celleVive++;

}

if(row != N-1 && col != N-1 && A[(row+1) * N + col+1] ==1) {

celleVive++;

}

if(A[row * N + col] == 0) { //cella morta

if(celleVive == 3) {

//A[row * N + col] = 1;

mod[row * N + col] = 1;

}

} else { //cella viva

if(celleVive < 2 || celleVive >3) {

//A[row * N + col] = 0;

mod[row * N + col] = 0;

}

}

}

and this is the “loop version”

__global__ void life(float *Ac,int N,float* mod,int* vr, int* vc, int numVicini) {

int col = blockIdx.x * blockDim.x + threadIdx.x;

int row = blockIdx.y * blockDim.y + threadIdx.y;

int celleVive = 0;

for(int i=0;i<numVicini;i++) { //numVicini = 8

if(row+vr[i] < N && row+vr[i]>=0 && col+vc[i] < N && col+vc[i] >= 0) {

if(Ac[(row+vr[i]) * N + col+vc[i]] == 1) {

celleVive++;

}

}

}

if(Ac[row * N + col] == 0) { //cella morta

if(celleVive == 3) {

mod[row * N + col] = 1;

}

} else { //cella viva

if(celleVive < 2 || celleVive >3) {

mod[row * N + col] = 0;

}

}

}

Is a simple implementation of game Conway’s Game of Life. The arrays vr and vc contains the neighbours.

The version with loop is slower than the version without loop. Why?

Thanks

Because you have all the extra memory accesses to vr and vc.

What compute capability is your device? You can place vr and vc in constant memory to take advantage of the constant cache. This will make less of a difference on 2.x devices though, where all memory reads are cached anyway.

Or, better yet, eliminate the lookups completely. When unrolled this just reproduces your version with the 8 ifs.

__global__ void life(float *Ac,int N,float* mod) {

int col = blockIdx.x * blockDim.x + threadIdx.x;

int row = blockIdx.y * blockDim.y + threadIdx.y;

int celleVive = 0;

#pragma unroll

for(int i=0;i<9;i++) {

int x = col-1 + i mod 3;

int y = row-1 + i / 3;

if(i != 4 && x < N && x >=0 && y < N && y >= 0) {

if(Ac[y * N + x] == 1) {

celleVive++;

}

}

}

if(Ac[row * N + col] == 0) { //cella morta

if(celleVive == 3) {

mod[row * N + col] = 1;

}

} else { //cella viva

if(celleVive < 2 || celleVive >3) {

mod[row * N + col] = 0;

}

}

}

I’ve placed the arrays in constant memory and performance is improved but not much.

With the #pragma unroll i receive this warning: “Advisory: Loop was not unrolled, unexpected control flow construct”.

Thank you very much for help.

Weird. Anyway, you can use this version:

__global__ void life(float *Ac,int N,float* mod) {

int col = blockIdx.x * blockDim.x + threadIdx.x;

int row = blockIdx.y * blockDim.y + threadIdx.y;

int celleVive = 0;

#pragma unroll

for(int i=0;i<9;i++) {

int x = col-1 + i % 3;

int y = row-1 + i / 3;

if (i != 4) {

if (x < N && x >= 0) {

if (y < N && y >= 0) {

if(Ac[y * N + x] == 1) {

celleVive++;

}

}

}

}

}

if(Ac[row * N + col] == 0) { //cella morta

if(celleVive == 3) {

mod[row * N + col] = 1;

}

if(celleVive < 2 || celleVive >3) {

mod[row * N + col] = 0;

}

}

}

And sorry for writing [font=“Courier New”]mod[/font] instead of [font=“Courier New”]%[/font] (I was writing VHDL in a different window), but you’ve apparently noticed that yourself.