In a kernel i do 8 control with if…
If instead of writing 8 if i write a for (with 8 loop) that do the same thing the version with loop is much slower.
Why?
Thanks.
In a kernel i do 8 control with if…
If instead of writing 8 if i write a for (with 8 loop) that do the same thing the version with loop is much slower.
Why?
Thanks.
I have no idea what you are asking. Can you post code for the two versions you are comparing?
Sure…
This is the “If version”
__global__ void life(float *A,int N,float* mod) {
int col = blockIdx.x * blockDim.x + threadIdx.x;
int row = blockIdx.y * blockDim.y + threadIdx.y;
int celleVive = 0;
if(col != N-1 && A[row * N + col+1] == 1) {
celleVive++;
}
if(col != 0 && A[row * N + col-1] == 1) {
celleVive++;
}
if(row != N-1 && A[(row+1) * N + col] == 1) {
celleVive++;
}
if(row != 0 && A[(row-1) * N + col] == 1) {
celleVive++;
}
if(row != 0 && col != 0 && A[(row-1) * N + col-1] == 1) {
celleVive++;
}
if(row !=0 && col != N-1 && A[(row-1) * N + col+1] == 1) {
celleVive++;
}
if(row != N-1 && col != 0 && A[(row+1) * N + col-1] == 1) {
celleVive++;
}
if(row != N-1 && col != N-1 && A[(row+1) * N + col+1] ==1) {
celleVive++;
}
if(A[row * N + col] == 0) { //cella morta
if(celleVive == 3) {
//A[row * N + col] = 1;
mod[row * N + col] = 1;
}
} else { //cella viva
if(celleVive < 2 || celleVive >3) {
//A[row * N + col] = 0;
mod[row * N + col] = 0;
}
}
}
and this is the “loop version”
__global__ void life(float *Ac,int N,float* mod,int* vr, int* vc, int numVicini) {
int col = blockIdx.x * blockDim.x + threadIdx.x;
int row = blockIdx.y * blockDim.y + threadIdx.y;
int celleVive = 0;
for(int i=0;i<numVicini;i++) { //numVicini = 8
if(row+vr[i] < N && row+vr[i]>=0 && col+vc[i] < N && col+vc[i] >= 0) {
if(Ac[(row+vr[i]) * N + col+vc[i]] == 1) {
celleVive++;
}
}
}
if(Ac[row * N + col] == 0) { //cella morta
if(celleVive == 3) {
mod[row * N + col] = 1;
}
} else { //cella viva
if(celleVive < 2 || celleVive >3) {
mod[row * N + col] = 0;
}
}
}
Is a simple implementation of game Conway’s Game of Life. The arrays vr and vc contains the neighbours.
The version with loop is slower than the version without loop. Why?
Thanks
Because you have all the extra memory accesses to vr and vc.
What compute capability is your device? You can place vr and vc in constant memory to take advantage of the constant cache. This will make less of a difference on 2.x devices though, where all memory reads are cached anyway.
Or, better yet, eliminate the lookups completely. When unrolled this just reproduces your version with the 8 ifs.
__global__ void life(float *Ac,int N,float* mod) {
int col = blockIdx.x * blockDim.x + threadIdx.x;
int row = blockIdx.y * blockDim.y + threadIdx.y;
int celleVive = 0;
#pragma unroll
for(int i=0;i<9;i++) {
int x = col-1 + i mod 3;
int y = row-1 + i / 3;
if(i != 4 && x < N && x >=0 && y < N && y >= 0) {
if(Ac[y * N + x] == 1) {
celleVive++;
}
}
}
if(Ac[row * N + col] == 0) { //cella morta
if(celleVive == 3) {
mod[row * N + col] = 1;
}
} else { //cella viva
if(celleVive < 2 || celleVive >3) {
mod[row * N + col] = 0;
}
}
}
I’ve placed the arrays in constant memory and performance is improved but not much.
With the #pragma unroll i receive this warning: “Advisory: Loop was not unrolled, unexpected control flow construct”.
Thank you very much for help.
Weird. Anyway, you can use this version:
__global__ void life(float *Ac,int N,float* mod) {
int col = blockIdx.x * blockDim.x + threadIdx.x;
int row = blockIdx.y * blockDim.y + threadIdx.y;
int celleVive = 0;
#pragma unroll
for(int i=0;i<9;i++) {
int x = col-1 + i % 3;
int y = row-1 + i / 3;
if (i != 4) {
if (x < N && x >= 0) {
if (y < N && y >= 0) {
if(Ac[y * N + x] == 1) {
celleVive++;
}
}
}
}
}
if(Ac[row * N + col] == 0) { //cella morta
if(celleVive == 3) {
mod[row * N + col] = 1;
}
if(celleVive < 2 || celleVive >3) {
mod[row * N + col] = 0;
}
}
}
And sorry for writing [font=“Courier New”]mod[/font] instead of [font=“Courier New”]%[/font] (I was writing VHDL in a different window), but you’ve apparently noticed that yourself.