is There a Bank conflict in this kernel? using a __shared__ array

hello,

im having a little of trouble with this kernel.

first of all this is my struct which i work with.

typedef struct{

	float3 p1;

	float3 p2;

	float3 p3;

	int T[3];

}pTriangle;

and this is my kernel, you can see a shared array of “pTriangles” the size of MYCUDA_BLOCK_SIZE_COLISSION which is 128. and block size is 128, so each thread reads its corresponding index.

this code makes each thread to write data into the “iBlock” position of the shared array that will be used in the next part of the kernel that i havent put here.

__global__ void kernelDetectarInterseccionesShared(Vertex* vbo, GLuint* eab, GLuint numCaras, float vecindad){

	int j = blockIdx.x * blockDim.x;

	int iBlock = threadIdx.x;

	//array shared

	__shared__ pTriangle sharedTriangles[MYCUDA_BLOCK_SIZE_COLISSION];

	if( vecindad <= 0.0f ){ return; }

	if( i<numCaras && j < numCaras ){

		if( j+iBlock < numCaras ){

			sharedTriangles[iBlock].T[0] = eab[faceSize*(j+iBlock) + 0];

			sharedTriangles[iBlock].T[1] = eab[faceSize*(j+iBlock) + 1];

			sharedTriangles[iBlock].T[2] = eab[faceSize*(j+iBlock) + 2];

			sharedTriangles[iBlock].p1 = make_float3(vbo[ sharedTriangles[iBlock].T[0] ].x, vbo[ sharedTriangles[iBlock].T[0] ].y, vbo[ sharedTriangles[iBlock].T[0] ].z );

			sharedTriangles[iBlock].p2 = make_float3(vbo[ sharedTriangles[iBlock].T[1] ].x, vbo[ sharedTriangles[iBlock].T[1] ].y, vbo[ sharedTriangles[iBlock].T[1] ].z );

			sharedTriangles[iBlock].p3 = make_float3(vbo[ sharedTriangles[iBlock].T[2] ].x, vbo[ sharedTriangles[iBlock].T[2] ].y, vbo[ sharedTriangles[iBlock].T[2] ].z );

		}

	 }

}

the code is taking 20ms for a grid of (157,157) where each block size of 128, this is only part of the kernel, if i put the rest i get i timeout which i could post later if you guys tell me that this part doesnt have any problem.

could i still have conflicts because of the size of the struct or any other reason?

thanks in advance

i tested the kernel with Cuda Profiler and im getting

warp serialize = 9,000,000

which 6 Million come from the reads of Global memory (eab and vbo ) this ones may be inevitable but i wont read them again
and 3 Million come from writing to the shared array. this is important

do i need to change my struct structure to avoid bank conflicts?

To my understanding:
size of struct 12 32bit values and GCD(12,32) = 4 so yes.
adding a dummy 32bit would make GCD(13,32) = 1, no bank conflicts.

Should be easy to verify/reject.

Another problem which may severly reduce the performance of your kernel is global memory read.

Memory is best read if treads of the same half-warp read 32-bit words which are next to each other. (devices with CC 1.0 or 1.1 have stronger constraints on this). In your case however threads read 32-bit words which are faceSize*4 bytes apart. That is why I would suggest keeping your data in a “struct of arrays” instead of an “array of structs”. Consider the following:

template<int size>

struct pTriangleArrays{

	float p1x;

	float p1y;

	float p1z;

	float p2x;

	float p2y;

	float p2z;

	float p3x;

	float p3y;

	float p3z;

	int T0;

	int T1;

	int T2;

};

__global__ void kernelDetectarInterseccionesShared(float* vbo[3], GLuint* eab[3], GLuint numCaras, float vecindad){

	int j = blockIdx.x * blockDim.x;

	int iBlock = threadIdx.x;

	//array shared

	__shared__ pTriangleArrays<MYCUDA_BLOCK_SIZE_COLISSION> sharedTriangles;

	if( vecindad <= 0.0f ){ return; }

	if( i<numCaras && j < numCaras ){

		if( j+iBlock < numCaras ){

			sharedTriangles.T0[iBlock] = eab[0][j+iBlock]; //(*)

			sharedTriangles.T1[iBlock] = eab[1][j+iBlock];

			sharedTriangles.T2[iBlock] = eab[2][j+iBlock];

			sharedTriangles.p1x[iBlock] = vbo[0][sharedTriangles.T0[iBlock]]; //(**)

			sharedTriangles.p1y[iBlock] = vbo[0][sharedTriangles.T0[iBlock]];

			 [...]

		}

	 }

}

If you are using device with compute capability at least 1.2 line (*) global memory read will be perfectly or nearly-perfectly coalesced and it will introduce no bank conflicts at all.

Line () may be also by nearly-perfectly coalesced provided values stored at sharedTriangles.T0[iBlock] are simillar for nearby threads (difference not greater than 16 or so). You might want to use texture fetches here. Nevertheless, () will introduce no bank conflicts in shared memory either. With this organisation, I believe you should get data about 2 to 4 times faster, but don’t trust my word on this too much :)

For a device with compute capability 1.1 or 1.0, j must be a multiply of some number X, where X is - I believe - 16 or 32 or something like that. If you are using such device, you may get a boost up of reading up to 16 times faster :)