This is for a class project, and we’ve constrained the problem to make it a little tougher to see if the tougher problem can still be implemented well on a GPU.
We’re implementing the Smith-Waterman genome-matching algorithm, which compares a DNA sequence against a large database of known sequences. If I were doing this for any other purpose, I think the most natural partitioning is to partition the known sequences across blocks and do the entire sequence-sequence match within one block, which wouldn’t require any block-block communication. Alternatively, we’d pipeline the sequences through blocks. However, since this is an exercise, we’ve decided to see how fast we can do a single sequence-sequence compare in the GPU, and we want to do so across multiple blocks to maximize parallelism in the SMs.
The algorithm basically generates a scoring matrix between the two sequences being compared. Each element of the matrix depends on the values of the three elements above, left, and above-left. Suppose we partitioned this matrix into 4 quadrants, we could perform this calculation in 3 kernel calls. The 1st call would be 1 block and calculate the upper-left quadrant. The 2nd call would be 2 blocks (upper-right and lower-left), where each of the blocks depends on values of the upper-left quadrant’s edges. The 3rd call would be 1 block to calculate the lower-right quadrant, which depends on lower-left and upper-right.
We only need 2 SM’s for this case since our largest kernel only has 2 blocks in it. Suppose the 1st kernel’s block landed on SM0. When it is done computing (all of the computing is done in shared memory), it has to write the upper-left quadrant’s boundary elements to global memory so the 2 blocks in kernel 2 can access them. Now, in the 2nd kernel call, one of the two blocks may land on the same SM as the 1st kernel. That block has to read the upper-left quadrant’s elements from global memory to re-establish it’s faster shared-memory copy. Rather than saving shared-mem state to global memory at the end of the 1st kernel and then having to re-establish the shared-mem from global memory for the 2nd kernel, we think it would be better to write enough boundary elements to global memory to allow another block to satisfy it’s dependencies, but continue working into the next quadrant out of our already-existing shared-memory image.
With this approach, we assign the upper-left and upper-right quadrants to block0 and lower-left and lower-right quadrants to block1. When block0 finishes the upper-left quadrant, it writes it’s lower-edge elements to global memory and continues working on the upper-right quadrant (without having to write/read its right-edge to/from global memory). Block1 will see block0’s lower-edge elements become valid and start working on the lower-left quadrant. Similarly, when block 0 finishes its upper-right quadrant, it writes results to global mem, block1 sees the edge-elements are valid, and it completes the lower-right quadrant.
Basically, we’re just trying to optimize out a shared->global save state and a global->shared restore state as would be required if we used multiple kernel calls. Does any of this make sense?