result of bank conflict testing is weird! bank conflict

Hi, all,

I recently run some experiments to evaluate the performances with and without bank conflict. However, I found the run time is weird, please take a look at my codes:

const int gridsize=1;

const int blocksize=256;

const int N=gridsize*blocksize;
__global__ void no_bk( int iter)

{

    int idx = blockIdx.x*blockDim.x+ threadIdx.x;

    __shared__ int3 test[N];

    int3   temp=test[idx];		     

}

There is not supposed to have bank conflicts for above codes. Because stride is 3, as shown in CUDA Programming Guide 3.0 (Fig G-2).

__global__ void bk(int iter)

{

    int idx = blockIdx.x*blockDim.x+ threadIdx.x;

    __shared__ int2 test[N];		     

    int2  temp=test[idx];

}

There is supposed to have bank conflicts for above codes. Because stride is 2, as shown in CUDA Programming Guide 3.0 (Fig G-2).

However, the output of runtime is :

no bk: 0.117569 seconds

with bk: 0.000013 seconds


How can it be possible? Or something wrong happens in my codes?

Thanks


My GPU is computer capacity 1.1

This is how I compute time:


dim3 block(blocksize);

dim3 grid(gridsize);

	

cutCreateTimer(&timer);

cutStartTimer(timer);

    		

 no_bk<<<grid, block>>>(internal_iters);    

cudaThreadSynchronize();

cutStopTimer(timer);

printf("no bk: %f seconds\n", cutGetTimerValue(timer) / 1000);

no code is generated by nvcc because you don’t output any data.

You can use cuobjdump to check this.

Both of those kernel codes will compile to empty functions because of compiler optimization. I don’t know what time difference it is you are measuring, but it certainly doesn’t have anything to do with the code of either kernel.

Thank you both. But the problem, if I add an output to these kernel functions, then there will be a memory coalescing violation which will affect the accuracy of measuring bank conflict. Do you have an example that exclusively measure the performance of bank conflicts without influence of other issues (like memory coalescing)?

Thanks a lot

Deryk.

The following is my code:

__global__ void no_bk( int iter, int3 * output)

{

int idx = blockIdx.x*blockDim.x+ threadIdx.x;

__shared__ int3 test1[N];

for (int i = 0; i < iter; ++i)

	 output[idx]=test1[idx];

}
__global__ void bk(int iter, int2* output)

{

int idx = blockIdx.x*blockDim.x+ threadIdx.x;

	__shared__ int2 test2[N];		     

for (int i = 0; i < iter; ++i)

output[idx]=test2[idx];

}

Result is:

Starting GPU test v1 …

no bk: 0.020932 seconds

with bk: 0.001793 seconds

Still weird: performance with bank conflict is better than that without bank conflict. here N is 256.

Thanks.