I am writing an algorithm that needs to execute its tasks in a specific order. To work within this restriction, I have been calling my kernel with 1 block and X threads. I then just increment a counter in the kernel to deal with the fact that I have fewer threads than vector elements.
In some tests, I found that if I up the number of blocks to allow for 1 thread per element, the computation time is cut nearly in half (obviously…). Is there anyway to order the blocks to execute sequentially?