To solve this. Besides local memory (and global memory) one can a) use shared memory b) unroll the loops c) use conditional code to statically reorder registers
Related topics
Topic | Replies | Views | Activity | |
---|---|---|---|---|
Used Registers vs Live Registers | 14 | 3284 | June 28, 2020 | |
Bitslice-DES optimization | 55 | 12568 | January 29, 2022 | |
A strange phenomenon on register allocation. How to reduce register pressure? | 14 | 1235 | March 25, 2022 | |
On the register allocation optimization of cuda compiler | 12 | 3169 | January 20, 2019 | |
Getting nvcc to consolidate registers | 19 | 19495 | November 19, 2012 | |
Uint64_t result evaluation & storage eats up 25% of kernel performance | 28 | 940 | October 3, 2023 | |
How to make nvcc place variables in register instead of local memory when there's clearly enough space? | 10 | 49 | September 9, 2024 | |
Measurements of different CUDA operator throughputs | 32 | 49872 | August 24, 2009 | |
Problems with hand-made PTX and driver API Difficulty getting a simple hand-written PTX program to w | 13 | 3155 | October 12, 2011 | |
Can't make ptxas generate efficient code | 23 | 4400 | December 30, 2012 |