Well, if it is 20 times faster, you can run the kernel once to calculate the count values for each block, then use prefix-sum to convert them to the indexes you need and use them in the second real kernel call and you are still 10 times faster (even assuming you have to do the full work in both kernels, which is unlikely).
It might also make sense to e.g. use a small but constant “count” for the first kernel and already store those that fit and only re-run the blocks that did have parts that did not fit the second time.
Indeed, I also have a kernel where each thread outputs between 0 and (total_amount_of_threads/2) values. I also do first calculate how many outputs will be needed -> num_out array.
Then I scan that array to get a new array base_index array.
Then I have the threads actually perform the output num_out times, starting at base_index.
It takes 3 kernel calls in total, but it uses only 1 extra array, and still performance is really good.