I need to traverse a gigantic array of characters by a sliding window of length w, with offset of just 1 character. For quite a while, I still can not figure out a good access pattern which has the least latency for using a bunch of threads, each of which deals with a sequence window of w characters.
I made some sort of compression on the characters, because a word can have 4 characters contained in one unit.
ABCD DCBA ABDC DABC…
i.e. Thread 0 takes care of ABCDD, and Thread 1 takes care of BCDDC.