I recently run some experiments to evaluate the performances with and without bank conflict. However, I found the run time is weird, please take a look at my codes:
const int gridsize=1;
const int blocksize=256;
const int N=gridsize*blocksize;
__global__ void no_bk( int iter)
{
int idx = blockIdx.x*blockDim.x+ threadIdx.x;
__shared__ int3 test[N];
int3 temp=test[idx];
}
There is not supposed to have bank conflicts for above codes. Because stride is 3, as shown in CUDA Programming Guide 3.0 (Fig G-2).
Both of those kernel codes will compile to empty functions because of compiler optimization. I don’t know what time difference it is you are measuring, but it certainly doesn’t have anything to do with the code of either kernel.
Thank you both. But the problem, if I add an output to these kernel functions, then there will be a memory coalescing violation which will affect the accuracy of measuring bank conflict. Do you have an example that exclusively measure the performance of bank conflicts without influence of other issues (like memory coalescing)?