Algorithm to clean/shring array

Hi everyone,
I am working on something and stumbled across a bottle neck in my code and after thinking for hours, I cannot come up with a better approach. Probably someone here can help :)
OK, I have an array of structs. Each struct contains a variable that basically says if that particular struct is valid or not. I need to “remove” all invalid structs from that array. As it is a plain memory array (no list), I have to move them around.
My first approach was to just start a single kernel which goes though all the structs with two offsets, one for read and one for write. The read offset is incremented in every iteration. If the current read struct is valid it is copied to the write offset and the write offset is incremented. Very simple, but inefficient. This approach takes about 2 ms on my MX250, when I have 4096 struct of 72 bytes each (the number is not always a power of two!). Even if the algorithm would have to move every single struct, the data rate is a lousy 150 MB/s.
So, basically my question is: How can I do that more efficiently (in parallel)?

Edit: The order of the result does not matter.

Without changing the order of elements this is called stream compaction. It is implemented in Thrust or CUB .

It is not clear to me if your single kernel only uses a single thread or not since you ask for a parallel version.

If the order of elements is not important, you can use a kernel with 1 thread per input element.
Each thread checks whether the element should be kept or not. If it should be kept, atomically increment a global counter for the output index and store the element at that position.

This can be improved by reducing the number of atomic operations in global memory by counting the number of selected elements per block.

Ohh, i meant thread, not kernel. So I have only a single thread, which is of course slow, but I did not expected it to be that slow.
I think I got the idea using the global atomically counter. But what do you mean with counting the number of selected elements per block? Does that mean to use two kernels, one which first counts the number of valid objects per block and then calculate the offsets for each block and run a second kernel to move the values?
I am currently trying to do that with thrust::remove_if, but I guess it might not be a good solution as I don’t care about the order. So remove_if will have to move more data around which is not required.

OK, I finally got thrust::remove_if working. Works a bit faster, I am down to ~400us (I think I was wrong about the 2 ms, I think it was 20 ms). However, the remove_if stuff is still wasting ~50% of the total time to malloc/free some memory on the GPU. I don’t think this is required for my particular case. So I am still looking for a better approach.
I thought a bit about your suggestion. I think this can be done better, because in that algorithm every single struct has to be moved. I am quite sure that most of the time almost all of the structs are valid and need to be kept. So theoretically I would just need to move a few from the back to the front. Any suggestions?