I came out with a way to hide the memory latency to the global memory. There are 8192 32-bit registers and 16kB shared memory in one multiprocessor. I copy the data from global memory (GB) to regs, and while the threads are working on the regs, I copy the data from GB to shared memory. After threads finish, they turn to shared memory to continue. The process will be repeated. Is this prefetching or double buffering possible for the GPU? Any suggestions? Thanks.
The GPU already hides memory latency by doing computations while waiting on memory loads. As I understand it (somebody please correct me if I’m wrong) each multiprocessor has one warp actively doing computational work at any one time, while the other warps allocated to the multiprocessor wait for memory loads. This is the reason why it is recommended in the programming guide that you have 192 threads (6 warps) active at any one time on each multiprocessor, and why increasing occupancy often reduces runtime. It also is the reason why increasing occupancy can only do so much - once all the latency has been hidden there is no necessity to add more threads to the multiprocessor.
Well, in my case, I can instantiate only 16 threads per multiprocessor if shard memory is used. This is a task level parallelization, and each task needs about 1KB data. My strategy is to copy the data onto the shared memory. Max threads per multiprocessor = 16KB/1KB = 16. There are a lot of registers that are not been utilized, and that is the motivation.
How about we copy the data from global memory to both registers and the shared memory, and switch these two “buffers”?
I am doing something similar for Chess Algorithms, to hide memory latency, I use one calculation warp, that only use registers+shared memory to communicate with the other warp that do memory prefetching and kinda write-through cache :-)
So 2 warps per MP = 64 threads => 256 registers/threads (really nice to have for complex algorithms)
32 working threads = 512 bytes shared memory allocated for each threads.
I use other technics combined to that, such as macro-threading and micro-threading:
micro-threading is the idea of dispatching a pseudo-sequential task to many threads (ie: 8 threads, 1 per pawn when examining posiiton)
macro-threading is the idea of creation for example 4 group of threads (of 8 threads in this case) in a warp, each one examining a position similar to the other (few divergences)
You end up parallelizing 90% of the work with micro-threading, processing 4 positions in parallel, w/ 256 registers per thread and 4KB shared memory per position for Global Memory IO buffering :-)
When I am reading NVIDIA programming guide again, I find in Section 5.2, they are talking about the number of threads per block. It is suggested that 64 threads per block is the minimal and the shared memory usage should be at most half of the total amount (16KB/2=8KB) in order to have at least two blocks per multiprocessor. In this way, it allows overlap between blocks that wait and blocks that can run.
Then I split the data into finer granularity and put some of them in the global memory and some in the shared pool. Settings: 64 threads per block, 3 blocks per multiprocessor, 6 warps per multiprocessor, reg: 37, smem: 4KB, occupancy: 25%. It already offers 1.3x speedup in terms of time against my previous version.