Solid analysis and we share similar thoughts. This is a learning experiment for me using ethash (lots of examples, well understood problem space and easy to validate correctness) in preparation for porting proprietary CPU/SIMD software defined radio code
The compile-time ptxas live-register analysis clearly differs from SASS (maybe due to the multiple version possibilities). For my turing device, the nsight compute occupancy graph is eight register multiples (corresponding to the turing architectural analysis I read). Seems like SASS might need 80-81 live registers while burning 89-90 psx estimated registers then rounded up to 96 physical registers-- yikes. Assume the issue scales with kernel register requirements.
To your query about specific code, here is an interesting example (with build instructions) that demonstrates the underlying issue. This github ethereum project (GitHub - ethereum-mining/ethminer: Ethereum miner with OpenCL, CUDA and stratum support) has solid performance and beats my custom code (to my dismay). Of the GPU examples I have reviewed, first is a thread-specific serial keccak/SHA3 pass followed (after distributing state) by the parallel thread “memory hard” portion followed by (after consolidating state) a thread-specific keccak/SHA3 final test.
The main CUDA kernel (ethminer/dagger_shuffled.cuh at master · ethereum-mining/ethminer · GitHub) is relatively straightforward, though it contains some magic number choices that I assume were trail/error to accommodate ptxas. Here is the mostly raw code with #defines replaced by literals. The bfe() asm makes zero difference and the offset[p] array can be replaced by a local variable. The main interesting bits are the three loops and their step values.
for (uint32_t a = 0; a < 64; a += 4)
{
int t = bfe(a, 2u, 3u); // (a>>2)&7
for (uint32_t b = 0; b < 4; b++)
{
for (int p = 0; p < 4; p++)
{
offset[p] = fnv(init0[p] ^ (a + b), ((uint32_t*)&mix[p])[b]) % d_dag_size;
offset[p] = SHFL(offset[p], t, 8);
mix[p] = fnv4(mix[p], d_dag[offset[p]].uint4s[thread_id]);
}
}
}
A cleaner version with more explicit loop details:
#pragma nounroll
for (uint32_t a = 0; a < 64; a += 4)
{
#pragma unroll
for (uint32_t b = 0; b < 4; b++)
{
#pragma unroll
for (int p = 0; p < 4; p++)
{
offset[p] = fnv(init0[p] ^ (a + b), ((uint32_t*)&mix[p])[b]) % d_dag_size;
offset[p] = SHFL(offset[p], (a>>2)&7, 8);
mix[p] = fnv4(mix[p], d_dag[offset[p]].uint4s[thread_id]);
}
}
}
By default, small loops are unrolled and large loops are not. Under my turing/7.5, the non-pragma/pragma versions give identical performance. ptxas requires 80 used-registers which means 24/32 warp threads-- important since this code requires thread groups of 8. Full wrap occupancy would be interesting, but that is impossible-- no way go from 80 to 64. The opposite is easily possible-- unroll the main loop. In a perfect world, that wold eliminate the loop variable (along with some calculations that are now constants) giving a couple more registers and eliminating the loop structure (understanding the memory intensive nature would limit compute optimizations). Might add a few instruction cache misses (official documentation is sparse in this area), but doubt it is material.
Change the main loop #pragma from “nounroll” to “unroll” and ptxas used registers jumps from 80 to 90 and overall performance drops by 2%. The cause of the register bloat is the transition from the memory lookup phase to the state consolidation phase. The change in instruction mix allows using early register loads to reduce some stalls (though with negative payoff).
Number one requirement of localized optimizations is not making things worse (or providing the ability to disable locally if that is a possibility). Because ethash is memory hard by design, the remaining performance comes down to table scraps from other small optimizations (or lack thereof). Curious how this plays out for SDR, but expect similar. Algorithmic optimization against the main bottleneck and then fighting for small gains from whatever is left.
Played with one idea that showed promise to limit the scope of early register loads, but no joy. Created custom versions of shft_sync and xor2 using asm volatile, hoping the latter would prevent ptxas/sass from earlier scheduling of certain instructions (effectively creating a barrier) and in turn reducing register pressure. Turned into a game of whack-a-mole. It influenced register loads, but not with predictability. I want to work with the compiler, not fight against it. (Apologies for the code formatting-- the forum software seems to be missing a literal/code format.)