I am working on something and stumbled across a bottle neck in my code and after thinking for hours, I cannot come up with a better approach. Probably someone here can help :)
OK, I have an array of structs. Each struct contains a variable that basically says if that particular struct is valid or not. I need to “remove” all invalid structs from that array. As it is a plain memory array (no list), I have to move them around.
My first approach was to just start a single kernel which goes though all the structs with two offsets, one for read and one for write. The read offset is incremented in every iteration. If the current read struct is valid it is copied to the write offset and the write offset is incremented. Very simple, but inefficient. This approach takes about 2 ms on my MX250, when I have 4096 struct of 72 bytes each (the number is not always a power of two!). Even if the algorithm would have to move every single struct, the data rate is a lousy 150 MB/s.
So, basically my question is: How can I do that more efficiently (in parallel)?
Edit: The order of the result does not matter.