I would do it the following way:
- each thread stores in shared memory how many elements it wants do write. (T[threadIdx.x]:=amount_of_data)
- you perform a prefix sum (prefix scan) on array T. As a result each cell of the array holds the sum of all elements before it. There are efficient algorithms for that, google it or even search this forum :)
- last cell of the array should hold number N - a number of all data to be stored by the whole block.
- atomically increment global index by N (prev:=atomicAdd(ptr,N).
- Now each thread may safetly store its data under cells prev+T[threadIdx.x-1] … prev+T[threadIdx.x]
If the order of data is not significant for you and one thread may store its variable-length data at various positions (not necessairly one after another), you might want to consider using the reserved memory of size N differently, to have a more coalesced write instruction.