Suppose I have a large dataset in global memory of size N, and a kernel that I want to run many times over and over again. Now the problem I have is that the kernel does not need to process the entire dataset every frame (say the max items it needs to process is N/2, but the minimum could even be zero), but the items needed to be processed depends on the output of the last frame. So at the moment I can come up with two options:
- Process the entire N dataset every frame with a simple if statement to check if the current item should be processed (does this option mean there will be a lot of threads waiting all the time?)
- Create a schedule array, to which indicies of N are added to every frame (but will require atomic adds), and just process the entire schedule object every frame.
Given that work done on each item in N is nontrivial, does anyone have any recommendations of which I should choose, or a third option? Bear in mind, the items that need to be processed every frame are not in any continuous block.
I guess another way of asking the question is, if you have a large dataset, what is the most efficient way to run a kernel on a subset of the dataset?