Filling data to linear memory

Hi, I am new to CUDA.

I am trying to fill triangles to array but one by one (no space between data)

struct Vertex 

{

    float3 pos;

    float3 normal;

}

__device__ int globalIndex = 0;

__global__ void simpleKernel(Vertex * output)

{

	//struct Triangle { Vertex1, Vertex2, Vertex3 }

	//__shared__ <TriangleStruct> tmp[100]; //- using 100 threads per block

	

	//int writeIndex = threadIdx.x + threadIdx.y;		

	//--- Produce Triangle --

        // if triangle not created - exit thread

	//tmp[writeIndex].Vertex1 - save triangle vertices...

	//---------	

	//__syncthreads();

	

	//int globalWriteIndex = atomicAdd(&globalIndex, 3); //atomic returns oldValue

	//

	//write to output:

	//out[globalWriteIndex + 0] = tmp[writeIndex].vertex1

	//out[globalWriteIndex + 1] = tmp[writeIndex].vertex2

	//out[globalWriteIndex + 2] = tmp[writeIndex].vertex3

	//---------------------------------------------------------------

}

Is it possible ? out array should hold triangles in correct form (vertices in CW / CCW). I can not use globalWriteIndex calculated from thread and block IDx, because that gives me empty spaces between triangles if trinangle is not created.

Thanks