Little questions about conflict bank

Hi,

I am very confused about conflict bank

I implemented a cuda kernel using shared memory but the kernel demonstrate conflict bank ( i am not observate the performance increase).
I see the Optimization Capter of the cuda programing guide about shared memory and i dont understading why to use the “s” variable ( stride ) to kernel avoid conflict bank ( that cause the performance decrease .)

below my kernel:

global void calc_ez_share(float *ez,float *ga, float *dz, int dimx, int dimy){

/*

SUB_LARGURA is a lenght of block that is the same of the amout of shared memory used to the each iteration when the data located in global memory has load in the shared memory

*/

    __shared__ float ga_s[SUB_LARGURA][SUB_LARGURA];
    __shared__ float dz_s[SUB_LARGURA][SUB_LARGURA];



    int bx  = blockIdx.x;
    int by = blockIdx.y;

    int tx  = threadIdx.x;
    int ty = threadIdx.y;

    int ix = blockIdx.x*blockDim.x + threadIdx.x;
    int iy = blockIdx.y*blockDim.y + threadIdx.y;

    int idx = iy*dimx + ix;

    int Row = by * SUB_LARGURA + ty;
    int Col = bx * SUB_LARGURA + tx;

    int m = 0;

    float value=0;

    if((ix!=0)&&(iy!=0)&&(ix!=dimx-1)&&(iy!=dimx-1)){

            for (m = 0; m <TAM/SUB_LARGURA; m++) {

                    ga_s[ty][tx] = ga[Row*TAM + (m*SUB_LARGURA + tx)];
                    dz_s[ty][tx] = dz[Row*TAM + (m*SUB_LARGURA + tx)];


                    __syncthreads();


                     ez[Row*TAM + (m*SUB_LARGURA + tx)]=(ga_s[ty][tx] * dz_s[ty][tx]);
                }

    }
            __syncthreads();

}

The code is to slow as the kernel implemented with global memory.
The shared memory paralelism is not happen
what the way?

best regards

__shared__ float ga_s[SUB_LARGURA][SUB_LARGURA];

__shared__ float dz_s[SUB_LARGURA][SUB_LARGURA];

....

if((ix!=0)&&(iy!=0)&&(ix!=dimx-1)& &(iy!=dimx-1)){

for (m = 0; m <TAM/SUB_LARGURA; m++) {

ga_s[ty][tx] = ga[Row*TAM + (m*SUB_LARGURA + tx)];

        dz_s[ty][tx] = dz[Row*TAM + (m*SUB_LARGURA + tx)];

        __syncthreads();

        ez[Row*TAM + (m*SUB_LARGURA + tx)]=(ga_s[ty][tx] * dz_s[ty][tx]);

    }

}

You don’t re-use data, shared memory is redudant, try following code

//__shared__ float ga_s[SUB_LARGURA][SUB_LARGURA];

//__shared__ float dz_s[SUB_LARGURA][SUB_LARGURA];

....

if((ix!=0)&&(iy!=0)&&(ix!=dimx-1)& &(iy!=dimx-1)){

for (m = 0; m <TAM/SUB_LARGURA; m++) {

float a = ga[Row*TAM + (m*SUB_LARGURA + tx)];

        float b = dz[Row*TAM + (m*SUB_LARGURA + tx)];

        ez[Row*TAM + (m*SUB_LARGURA + tx)]= a * b ;

    }

}

hi

thanks :)

but using the code that you post the execution time is the same of the global memory implementation.

other question … when you declare:

float a = ga[Row*TAM + (m*SUB_LARGURA + tx)];

you are using the registers of each thread correct?

thus the performance penality is not interconected with bank conflict…all right?

this code (that i am using) is a fragment of an FDTD implementation that simulate eletromagnetic waves.

all articles using shared memory as my code and mention the performance increase.

where is the bug?

any idea?

best regards

yes.

I don’t look at FDTD code. But the key is data reuse.

In general, Finite Difference Method would use coefficient/data from neighbors. So you need to

load a block (2-D or 3-D) into shared memory before you do standard stencil.

You need to check this part.

it’s realy… the FDTD use the neighboards… below following a example:

/*

this kernel calculate the density and need of the neighbors.

It's an global memory kernel implementation.

*/

__global__ void calc_dz(float *dz,float *hx, float *hy,float pulse, int dimx, int dimy){

	int ix = blockIdx.x*blockDim.x + threadIdx.x;

	int iy = blockIdx.y*blockDim.y + threadIdx.y;

	int idx = iy*dimx + ix;

	if((ix!=0)&&(iy!=0)&&(ix!=dimx-1)&&(iy!=dimx-1))

		dz[idx]=dz[idx]+0.5*(hy[idx]-hy[iy*dimx+(ix-1)]-hx[idx]+hx[(iy-1)*dimx+ix]);

__syncthreads();

	

	

	dz[(dimy*dimx+dimx)/2]=pulse;

}

the kernel above is response to calculate the magnetic density (it’s using neighbors).

but to meansure the shared memory performance increase i am choose the simpliest kernel… the eletric field kernel…which the neighbors’ use is not necessary.

The problem is…the eletric field calculate use only a simple multiplication…and i don’t understanding why the people obtain performance increase if the data reuse is impossible…then i suposse: “the optimization is obtain only transfering the data to the shared memory and my code have conflict banks” but your code is the opposite demonstration

there is another way to this (performance increse using shared memory, without data reuse)? because the articles which my code is based using the same kernel explicited in the first post and say “there is a performance increase”

best regards

you can use cuobjdump to check assembly code. Then you would know shared memory is extra cost if you don’t reuse data.
In other words, if you don’t reuse data, then using shared memory would increase instruction counts.
As for speedup due to shared memory, you need to specify a NUMBER, 10% speedup?

ok i am try to use cuobjdump…i want around up to 30% speedup…70% without reuse data ´s perfect

but i perceive … if i am use shared memory without reuse…the loop "for () " used to load the data from global … increase the cost in comparsion with the global memory kernel.

the code is in my home…later i am tryng again…

have you mail ?

best regards

Good morning,

I sent the questions to the article’s main author which did the code.
i am wait the reply and the i will post the code…

best regards