Streaming data for "one-time" use from the HBM on a V100

Previously I’ve been trying to keep data relevant to my computation in the L1 and re-use it locally, which in my case, meant I could only run around 8-16 WARP threads per GPU block/SM at a time – any more and the data just wouldn’t fit. This has yielded medicore results. Part of me suspects I’ve been doing it wrong.

Instead, I’m considering running 1024 WARP threads per block/SM and streaming the data from the HBM.

My new algorithm is something like the following:

Each thread takes as input a 32-bit int, let’s call it key, and a 64-entry array of 32-bit ints. Each thread iterates through its 64-entry array until it either finds a match or finishes iterating.

(1) Let’s assume there is an initial latency of 400+ cycles (a fictitious number) due to missing in the L1 and L2. Will the 63 reads that follow that initial miss be prefetched? What should I consider their latency to be? 400+ cycles? 3 cycles?

(2) Will the latency for the initial read of threads 32-63 be any faster than the latency of the initial read for threads 0-31? Given threads 0-31 execute immediately before threads 32-63 (on the same block/SM), would a couple cycles be saved?

(3) How long does it take for the HBM->register prefetcher (which may be an abstraction for multiple layers of prefetechers) to kick in? Does such a prefetcher exist?

(4) Would interleaving the data for each thread help? That is, instead of having 1024 different arrays (i.e, int array[1024][64]), should I have 64 1024-int arrays (i.e., int array[64][1024])?

(5) Would interleaving the interleavings help? I.e., let’s assume that 32-bytes constitute a cache line. Should I store 8-threads’ worth of data adjacently? I think this would be implemented with int array[1024/8][64][8].


There is no active prefetching for L1: it’s ineffective as L1 is too small for this (only 64KB-96KB L1 per SM), particularly if one is almost at capacity and there are too many collisions. For L2, NVIDIA has some very limited prefetching mechanism, but the details are not exposed externally and it’s not guaranteed to work for any predefined access pattern. L1 mostly helps with repeated reads within a warp execution (it’s also write-through, so limited help for reductions: use shared memory or atomics for this).

In general, the GPUs memory access is designed to hide memory latencies by running and switching between many threads/warps, not by minimizing the latency from a single thread/warp or even a couple of them. The warps can be executed in any order and load/store requests are issued for the whole warp (usually 32 threads) in 32B fragments (out of the 128B cache lines). So, the answer for [1-3] depends more on the warp scheduler, the order of execution, and the number of kernels running. Given the 32 bit ints, it’s only 8 of them for each load/store request. Likely the latency without context switch will be the same. Note that fetches are pipelined (just like a CPU) - the memory system can have many outstanding read requests, so fetches from the second warp of threads (32-63) will follow the first without waiting for the first to complete. That means that while the latency will be the same (400us in his example), the second fetch will return within a few cycles of the first. The gap depends on the rate at which the “load” instructions are issued.

For [4-5], its the “array-of-structures / structure-of-arrays” question. For GPUs you always want SoA. You would be much better off to have data[64][1024] to facilitate loads within a warp.

Thanks @alexvk!