Previously I’ve been trying to keep data relevant to my computation in the L1 and re-use it locally, which in my case, meant I could only run around 8-16 WARP threads per GPU block/SM at a time – any more and the data just wouldn’t fit. This has yielded medicore results. Part of me suspects I’ve been doing it wrong.
Instead, I’m considering running 1024 WARP threads per block/SM and streaming the data from the HBM.
My new algorithm is something like the following:
Each thread takes as input a 32-bit int, let’s call it
key, and a 64-entry array of 32-bit ints. Each thread iterates through its 64-entry array until it either finds a match or finishes iterating.
(1) Let’s assume there is an initial latency of 400+ cycles (a fictitious number) due to missing in the L1 and L2. Will the 63 reads that follow that initial miss be prefetched? What should I consider their latency to be? 400+ cycles? 3 cycles?
(2) Will the latency for the initial read of threads 32-63 be any faster than the latency of the initial read for threads 0-31? Given threads 0-31 execute immediately before threads 32-63 (on the same block/SM), would a couple cycles be saved?
(3) How long does it take for the HBM->register prefetcher (which may be an abstraction for multiple layers of prefetechers) to kick in? Does such a prefetcher exist?
(4) Would interleaving the data for each thread help? That is, instead of having 1024 different arrays (i.e,
int array), should I have 64 1024-int arrays (i.e.,
(5) Would interleaving the interleavings help? I.e., let’s assume that 32-bytes constitute a cache line. Should I store 8-threads’ worth of data adjacently? I think this would be implemented with