Optimizing workload with Large amounts of computations but small amount of results

So this might be more of a theoretical parallel optimization problem then a CUDA specific problem per say. I’m very new to Parallel Programming in general so this may just be personal ignorance.

So I have a workload that consists of a 64 bit binary number upon which I run analysis on. If the analysis completes then that binary number is a valid solution. If the analysis breaks midway then it’s invalid. The end goal is to get a list of all the valid solutions.

Now there are many trillions of 64 bit binary numbers I am analyzing but only ~5% or less will be valid solutions, and they usually come in bunches (ie. every consecutive 1000 numbers are valid and then every random billion or so are invalid). I can’t find a pattern to the space between bunches so I can’t ignore the large chunks of invalid solutions.

Currently, every thread in a kernel call analyzes just one number. If the number is valid it denotes it as such in it’s respective place on a device array. Ditto if it’s invalid. So basically I generate a data point for very value analyzed regardless if it’s valid or not. Then once the array is full I copy it to host only if a valid solution was found (denoted by a flag on the device). With this, overall throughput is greatest when the array is the same size as the # of threads in the grid.

But Copying Memory to & from the GPU is expensive time wise. That said what I would like to do is copy data over only when necessary; I want to fill up a device array with only valid solutions and then once the array is full then copy it over from the host. But how do you consecutively fill an array up in a parallel environment? Or am I approaching this problem the wrong way?

I’m pretty sure what you’re describing is a form of filtering or reduction operation called stream compaction.

This blog:

https://devblogs.nvidia.com/cuda-pro-tip-optimized-filtering-warp-aggregated-atomics/

briefly touches on the basic idea, then pretty quickly dives into an “extreme optimized” method.

More generally, this is easy to do with either atomics or a library like cub or thrust.