hello,

im having a little of trouble with this kernel.

first of all this is my struct which i work with.

```
typedef struct{
float3 p1;
float3 p2;
float3 p3;
int T[3];
}pTriangle;
```

and this is my kernel, you can see a shared array of “pTriangles” the size of MYCUDA_BLOCK_SIZE_COLISSION which is 128. and block size is 128, so each thread reads its corresponding index.

this code makes each thread to write data into the “iBlock” position of the shared array that will be used in the next part of the kernel that i havent put here.

```
__global__ void kernelDetectarInterseccionesShared(Vertex* vbo, GLuint* eab, GLuint numCaras, float vecindad){
int j = blockIdx.x * blockDim.x;
int iBlock = threadIdx.x;
//array shared
__shared__ pTriangle sharedTriangles[MYCUDA_BLOCK_SIZE_COLISSION];
if( vecindad <= 0.0f ){ return; }
if( i<numCaras && j < numCaras ){
if( j+iBlock < numCaras ){
sharedTriangles[iBlock].T[0] = eab[faceSize*(j+iBlock) + 0];
sharedTriangles[iBlock].T[1] = eab[faceSize*(j+iBlock) + 1];
sharedTriangles[iBlock].T[2] = eab[faceSize*(j+iBlock) + 2];
sharedTriangles[iBlock].p1 = make_float3(vbo[ sharedTriangles[iBlock].T[0] ].x, vbo[ sharedTriangles[iBlock].T[0] ].y, vbo[ sharedTriangles[iBlock].T[0] ].z );
sharedTriangles[iBlock].p2 = make_float3(vbo[ sharedTriangles[iBlock].T[1] ].x, vbo[ sharedTriangles[iBlock].T[1] ].y, vbo[ sharedTriangles[iBlock].T[1] ].z );
sharedTriangles[iBlock].p3 = make_float3(vbo[ sharedTriangles[iBlock].T[2] ].x, vbo[ sharedTriangles[iBlock].T[2] ].y, vbo[ sharedTriangles[iBlock].T[2] ].z );
}
}
}
```

the code is taking 20ms for a grid of (157,157) where each block size of 128, this is only part of the kernel, if i put the rest i get i timeout which i could post later if you guys tell me that this part doesnt have any problem.

could i still have conflicts because of the size of the struct or any other reason?

thanks in advance